Obowiązkowe elementy:
Anailiza danych:
Sprawdzenie ilości NaN, sprawozdanie
Sprawdzenie, czy w zbiorze występują braki danych. Należy sporządzić odrębne podsumowanie, skupiając się na poszukiwaniu brakujących wartości w zbiorze – Pozwoli to Państwu odpowiedzieć na pytanie, jakie zmienne zawierają braki i jaka jest ich liczba, z czego mogą one wynikać itd.
Etap ten (wraz z poprzednim) pozwoli Państwu odnaleźć błędy w danych – brakujące wartości, błędne interpretacje rodzaju zmiennych itp. Da również wskazówki, które atrybuty wybrać do analizy (pod kątem ich istotności dla przewidywań modelu), czy i jak uzupełnić brakujące dane (ewentualnie usunąć wiersze/kolumny, zawierające zbyt wiele braków danych), dokonać ich transformacji itd.
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
import textwrap
import geopandas as gpd
import json
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
%matplotlib inline
pip install -U kaleido # do wyświetlania wykresów w formacie .svg zamiast interaktywnych
Requirement already satisfied: kaleido in /usr/local/lib/python3.10/dist-packages (0.2.1)
import plotly.io as pio
gdf = gpd.read_file("/content/drive/MyDrive/Analiza danych python/Ćwiczenia/AirBnB/Barcelona/neighbourhoods.geojson")
gdf = gdf.reset_index()
print(gdf.shape)
gdf.head(1)
(75, 4)
| index | neighbourhood | neighbourhood_group | geometry | |
|---|---|---|---|---|
| 0 | 0 | el Raval | Ciutat Vella | MULTIPOLYGON (((2.17737 41.37532, 2.17852 41.3... |
geojson = json.loads(gdf.to_json())
fig = px.choropleth_mapbox(gdf, geojson=geojson,
locations=gdf.index,
color='neighbourhood_group',
hover_name='neighbourhood',
hover_data={'neighbourhood': False, 'neighbourhood_group': False, 'index': False},
featureidkey='properties.index',
mapbox_style='carto-positron',
center={'lat': 41.3933, 'lon': 2.1734},
zoom=11,
opacity=0.5)
fig.update_layout(
height=800, template='plotly_dark', bargap=0.05,
title='Mapa dzielnic Barcelony')
fig.show()
df = pd.read_csv('/content/drive/MyDrive/Analiza danych python/Ćwiczenia/AirBnB/Barcelona/listings.csv.gz')
print(f'Wymiary: {df.shape}\n')
df.head(1)
Wymiary: (18086, 75)
| id | listing_url | scrape_id | last_scraped | source | name | description | neighborhood_overview | picture_url | host_id | ... | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 18674 | https://www.airbnb.com/rooms/18674 | 20230906022853 | 2023-09-06 | city scrape | Rental unit in Barcelona · ★4.30 · 3 bedrooms ... | 110m2 apartment to rent in Barcelona. Located ... | Apartment in Barcelona located in the heart of... | https://a0.muscache.com/pictures/13031453/413c... | 71615 | ... | 4.65 | 4.78 | 4.27 | HUTB-002062 | t | 30 | 30 | 0 | 0 | 0.3 |
1 rows × 75 columns
NaN i typy danych¶Wnioski z tego kroku:
object na floatobject na boolobject na datetimehost_response_rate i host_acceptance_rate) wymagają poprawy z object na floatbathrooms jest zbędna, informacje zawarte w formie tekstu w bathroom_textcalendar_updated jest zupełnie pusta, można usunąćneighbourhood można usunąć, neighbourhood_cleansed ma lepsze (na podstawie współrzędnych)amenities mogłaby być przekształcona na kilka kolumn one-hothost_response_time może być zakodowana w formie ordinal encoding| id | listing_url | scrape_id | last_scraped | source |
| name | description | neighborhood_overview | picture_url | host_id |
| host_url | host_name | host_since | host_location | host_about |
| host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_thumbnail_url |
| host_picture_url | host_neighbourhood | host_listings_count | host_total_listings_count | host_verifications |
| host_has_profile_pic | host_identity_verified | neighbourhood | neighbourhood_cleansed | neighbourhood_group_cleansed |
| latitude | longitude | property_type | room_type | accommodates |
| bathrooms | bathrooms_text | bedrooms | beds | amenities |
| price | minimum_nights | maximum_nights | minimum_minimum_nights | maximum_minimum_nights |
| minimum_maximum_nights | maximum_maximum_nights | minimum_nights_avg_ntm | maximum_nights_avg_ntm | calendar_updated |
| has_availability | availability_30 | availability_60 | availability_90 | availability_365 |
| calendar_last_scraped | number_of_reviews | number_of_reviews_ltm | number_of_reviews_l30d | first_review |
| last_review | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin |
| review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable |
| calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month |
print(pd.DataFrame(np.array(df.columns.tolist()).reshape(15,5), columns=['', '', '', '', '']).to_markdown(index=False))
length = df.shape[0]
print(f'Total records: {length}')
for i in np.array(df.columns.tolist()).reshape(15,5):
hd = df[i].head(3)
nans = [df[j].isna().sum() for j in hd.columns]
# nans_perc = [(str(int(round(df[j].isna().sum() / length * 100, 4))) + '%') for j in hd.columns]
nans_perc = [ round(df[j].isna().sum() / length, 3) * 100 for j in hd.columns ]
# nans_perc = [round(df[j].isna().sum() / length, 3) * 100 for j in hd.columns]
uniques = [df[j].nunique() for j in hd.columns]
typ = [ df[df[j].notnull()][j].dtype for j in hd.columns]
# print(typ)
hd.loc[len(hd)] = typ
hd.loc[len(hd)] = nans
hd.loc[len(hd)] = nans_perc
hd.loc[len(hd)] = uniques
hd.loc[len(hd)-3] = hd.loc[len(hd)-3].astype('int')
hd.loc[len(hd)-1] = hd.loc[len(hd)-1].astype('int')
hd = hd.rename(index={int(len(hd)-4): 'Data type',
int(len(hd)-3): 'Total NaN',
int(len(hd)-2): '% NaN',
int(len(hd)-1): 'Unique values',
})
display(hd)
print(' ')
Total records: 18086
| id | listing_url | scrape_id | last_scraped | source | |
|---|---|---|---|---|---|
| 0 | 18674 | https://www.airbnb.com/rooms/18674 | 20230906022853 | 2023-09-06 | city scrape |
| 1 | 23197 | https://www.airbnb.com/rooms/23197 | 20230906022853 | 2023-09-06 | city scrape |
| 2 | 117010 | https://www.airbnb.com/rooms/117010 | 20230906022853 | 2023-09-06 | previous scrape |
| Data type | int64 | object | int64 | object | object |
| Total NaN | 0 | 0 | 0 | 0 | 0 |
| % NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Unique values | 18086 | 18086 | 1 | 1 | 2 |
| name | description | neighborhood_overview | picture_url | host_id | |
|---|---|---|---|---|---|
| 0 | Rental unit in Barcelona · ★4.30 · 3 bedrooms ... | 110m2 apartment to rent in Barcelona. Located ... | Apartment in Barcelona located in the heart of... | https://a0.muscache.com/pictures/13031453/413c... | 71615 |
| 1 | Rental unit in Sant Adria de Besos · ★4.77 · 3... | Beautiful spacious apartment, large terrace, 5... | Strategically located in the Parc del Fòrum, a... | https://a0.muscache.com/pictures/miso/Hosting-... | 90417 |
| 2 | Rental unit in Barcelona · ★4.55 · 3 bedrooms ... | Have an authentic Barcelona experience! Modern... | Apartment is located just 240 meters from Sagr... | https://a0.muscache.com/pictures/4052d8b5-7352... | 567180 |
| Data type | object | object | object | object | int64 |
| Total NaN | 0 | 118 | 7978 | 0 | 0 |
| % NaN | 0.0 | 0.7 | 44.1 | 0.0 | 0.0 |
| Unique values | 6237 | 16445 | 6967 | 17751 | 6969 |
| host_url | host_name | host_since | host_location | host_about | |
|---|---|---|---|---|---|
| 0 | https://www.airbnb.com/users/show/71615 | Mireia And Maria | 2010-01-19 | Barcelona, Spain | We are Mireia (47) & Maria (49), two multiling... |
| 1 | https://www.airbnb.com/users/show/90417 | Etain (Marnie) | 2010-03-09 | Catalonia, Spain | Hi there, I'm marnie from Australia, though I ... |
| 2 | https://www.airbnb.com/users/show/567180 | Sandra | 2011-05-08 | Barcelona, Spain | Suite Home Sagrada Familia is a complex of ap... |
| Data type | object | object | object | object | object |
| Total NaN | 0 | 2 | 2 | 4058 | 6698 |
| % NaN | 0.0 | 0.0 | 0.0 | 22.4 | 37.0 |
| Unique values | 6969 | 3363 | 3284 | 403 | 3560 |
| host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_thumbnail_url | |
|---|---|---|---|---|---|
| 0 | within an hour | 97% | 90% | f | https://a0.muscache.com/im/users/71615/profile... |
| 1 | within an hour | 100% | 94% | t | https://a0.muscache.com/im/pictures/user/44b56... |
| 2 | within a few hours | 88% | 98% | f | https://a0.muscache.com/im/pictures/user/7aaeb... |
| Data type | object | object | object | object | object |
| Total NaN | 2963 | 2963 | 2685 | 419 | 2 |
| % NaN | 16.4 | 16.4 | 14.8 | 2.3 | 0.0 |
| Unique values | 4 | 69 | 97 | 2 | 6804 |
| host_picture_url | host_neighbourhood | host_listings_count | host_total_listings_count | host_verifications | |
|---|---|---|---|---|---|
| 0 | https://a0.muscache.com/im/users/71615/profile... | la Sagrada Família | 47.0 | 48.0 | ['email', 'phone'] |
| 1 | https://a0.muscache.com/im/pictures/user/44b56... | El Besòs i el Maresme | 6.0 | 9.0 | ['email', 'phone'] |
| 2 | https://a0.muscache.com/im/pictures/user/7aaeb... | la Sagrada Familia | 19.0 | 19.0 | ['email', 'phone'] |
| Data type | object | object | float64 | float64 | object |
| Total NaN | 2 | 8545 | 2 | 2 | 0 |
| % NaN | 0.0 | 47.2 | 0.0 | 0.0 | 0.0 |
| Unique values | 6804 | 152 | 94 | 128 | 8 |
| host_has_profile_pic | host_identity_verified | neighbourhood | neighbourhood_cleansed | neighbourhood_group_cleansed | |
|---|---|---|---|---|---|
| 0 | t | t | Barcelona, CT, Spain | la Sagrada Família | Eixample |
| 1 | t | t | Sant Adria de Besos, Barcelona, Spain | el Besòs i el Maresme | Sant Martí |
| 2 | t | f | Barcelona, Catalonia, Spain | la Sagrada Família | Eixample |
| Data type | object | object | object | object | object |
| Total NaN | 2 | 2 | 7978 | 0 | 0 |
| % NaN | 0.0 | 0.0 | 44.1 | 0.0 | 0.0 |
| Unique values | 2 | 2 | 63 | 70 | 10 |
| latitude | longitude | property_type | room_type | accommodates | |
|---|---|---|---|---|---|
| 0 | 41.40556 | 2.17262 | Entire rental unit | Entire home/apt | 8 |
| 1 | 41.412432 | 2.21975 | Entire rental unit | Entire home/apt | 5 |
| 2 | 41.40647 | 2.17457 | Entire rental unit | Entire home/apt | 8 |
| Data type | float64 | float64 | object | object | int64 |
| Total NaN | 0 | 0 | 0 | 0 | 0 |
| % NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Unique values | 8878 | 9741 | 61 | 4 | 16 |
| bathrooms | bathrooms_text | bedrooms | beds | amenities | |
|---|---|---|---|---|---|
| 0 | NaN | 2 baths | 3.0 | 6.0 | ["Refrigerator", "30\" TV", "Iron", "Hangers",... |
| 1 | NaN | 2 baths | 3.0 | 4.0 | ["Refrigerator", "Toaster", "Cleaning products... |
| 2 | NaN | 2 baths | 3.0 | 6.0 | ["Refrigerator", "Clothing storage: closet and... |
| Data type | float64 | object | float64 | float64 | object |
| Total NaN | 18086 | 9 | 6360 | 314 | 0 |
| % NaN | 100.0 | 0.0 | 35.2 | 1.7 | 0.0 |
| Unique values | 0 | 34 | 12 | 21 | 15556 |
| price | minimum_nights | maximum_nights | minimum_minimum_nights | maximum_minimum_nights | |
|---|---|---|---|---|---|
| 0 | $202.00 | 1 | 1125 | 1 | 4 |
| 1 | $255.00 | 3 | 300 | 3 | 5 |
| 2 | $331.00 | 2 | 30 | 2 | 3 |
| Data type | object | int64 | int64 | int64 | int64 |
| Total NaN | 0 | 0 | 0 | 0 | 0 |
| % NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Unique values | 698 | 86 | 204 | 88 | 98 |
| minimum_maximum_nights | maximum_maximum_nights | minimum_nights_avg_ntm | maximum_nights_avg_ntm | calendar_updated | |
|---|---|---|---|---|---|
| 0 | 1125 | 1125 | 2.2 | 1125.0 | NaN |
| 1 | 1125 | 1125 | 3.2 | 1125.0 | NaN |
| 2 | 30 | 32 | 2.0 | 30.3 | NaN |
| Data type | int64 | int64 | float64 | float64 | float64 |
| Total NaN | 0 | 0 | 0 | 0 | 18086 |
| % NaN | 0.0 | 0.0 | 0.0 | 0.0 | 100.0 |
| Unique values | 192 | 189 | 412 | 817 | 0 |
| has_availability | availability_30 | availability_60 | availability_90 | availability_365 | |
|---|---|---|---|---|---|
| 0 | t | 4 | 11 | 21 | 34 |
| 1 | t | 16 | 31 | 61 | 150 |
| 2 | f | 0 | 0 | 0 | 0 |
| Data type | object | int64 | int64 | int64 | int64 |
| Total NaN | 0 | 0 | 0 | 0 | 0 |
| % NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Unique values | 2 | 31 | 61 | 91 | 366 |
| calendar_last_scraped | number_of_reviews | number_of_reviews_ltm | number_of_reviews_l30d | first_review | |
|---|---|---|---|---|---|
| 0 | 2023-09-06 | 38 | 8 | 0 | 2013-05-27 |
| 1 | 2023-09-06 | 73 | 11 | 1 | 2011-03-15 |
| 2 | 2023-09-06 | 48 | 6 | 1 | 2011-08-09 |
| Data type | object | int64 | int64 | int64 | object |
| Total NaN | 0 | 0 | 0 | 0 | 4466 |
| % NaN | 0.0 | 0.0 | 0.0 | 0.0 | 24.7 |
| Unique values | 1 | 532 | 179 | 32 | 3355 |
| last_review | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | |
|---|---|---|---|---|---|
| 0 | 2023-06-26 | 4.3 | 4.41 | 4.62 | 4.76 |
| 1 | 2023-08-15 | 4.77 | 4.93 | 4.89 | 4.94 |
| 2 | 2023-08-13 | 4.55 | 4.59 | 4.57 | 4.82 |
| Data type | object | float64 | float64 | float64 | float64 |
| Total NaN | 4466 | 4466 | 4549 | 4548 | 4553 |
| % NaN | 24.7 | 24.7 | 25.2 | 25.1 | 25.2 |
| Unique values | 1658 | 180 | 174 | 186 | 172 |
| review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | |
|---|---|---|---|---|---|
| 0 | 4.65 | 4.78 | 4.27 | HUTB-002062 | t |
| 1 | 4.99 | 4.6 | 4.64 | HUTB005057 | f |
| 2 | 4.91 | 4.86 | 4.59 | HUTB000331 | f |
| Data type | float64 | float64 | float64 | object | object |
| Total NaN | 4547 | 4552 | 4553 | 5764 | 0 |
| % NaN | 25.1 | 25.2 | 25.2 | 31.9 | 0.0 |
| Unique values | 158 | 154 | 192 | 5635 | 2 |
| calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|
| 0 | 30 | 30 | 0 | 0 | 0.3 |
| 1 | 2 | 2 | 0 | 0 | 0.48 |
| 2 | 19 | 19 | 0 | 0 | 0.33 |
| Data type | int64 | int64 | int64 | int64 | float64 |
| Total NaN | 0 | 0 | 0 | 0 | 4466 |
| % NaN | 0.0 | 0.0 | 0.0 | 0.0 | 24.7 |
| Unique values | 66 | 62 | 30 | 10 | 782 |
price¶NaNdef fix_price(price):
price = price.split('$')[1]
if ',' in price:
price = price.replace(',', '')
return float(price)
fix_price('$1,000')
1000.0
df['price'] = df['price'].apply(lambda x: fix_price(x))
# bool_fix = {
# 'f' : False,
# 't': True
# }
# df = df.applymap(lambda x: bool_fix.get(x, x))
bool_fix = {
'f' : 0,
't': 1
}
df = df.applymap(lambda x: bool_fix.get(x, x))
def fix_rates(rate):
if isinstance(rate, str):
rate = rate.split('%')[0]
return int(rate) / 100
else:
return rate
df['host_response_rate'] = df['host_response_rate'].apply(lambda x: fix_rates(x))
df['host_acceptance_rate'] = df['host_acceptance_rate'].apply(lambda x: fix_rates(x))
df['bathrooms_text'].unique()
array(['2 baths', '1.5 baths', '2.5 baths', '3 baths', '1 shared bath',
'1 bath', '1 private bath', '3.5 baths', '4 baths',
'1.5 shared baths', nan, '2 shared baths', '2.5 shared baths',
'5.5 baths', '7.5 baths', '4.5 baths', '6 baths', '0 shared baths',
'Half-bath', 'Private half-bath', '0 baths', '5 baths', '8 baths',
'3 shared baths', '8 shared baths', '4 shared baths',
'Shared half-bath', '5 shared baths', '3.5 shared baths',
'6 shared baths', '5.5 shared baths', '10 baths',
'10 shared baths', '6.5 baths', '4.5 shared baths'], dtype=object)
Pomysł na rozwiązanie:
shared_bathrooms druga private_bathrooms i zostawić ogólną kolumnę nie rozróżniającą między prywatną i dzieloną łazienkąnan, który jest typu float - będzie 0 łazienekdef fix_private_bathrooms(bathtext):
if isinstance(bathtext, float): # wyjątek dla `nan`
return 0
if isinstance(bathtext, str):
if 'shared' in bathtext.lower():
return 0
else:
n = bathtext.split(' ')[0]
try: # czy jest w ogóle cyfra czy np. tylko 'Half-bath'
return float(n)
except ValueError:
return 1
else:
return bathtext # zwróci None do uzupełnienia później
def fix_shared_bathrooms(bathtext):
if isinstance(bathtext, float):
return 0
if isinstance(bathtext, str):
if 'shared' not in bathtext.lower():
return 0
else:
n = bathtext.split(' ')[0]
try:
return float(n)
except ValueError:
return 1
else:
return bathtext
def total_baths(bathtext): # nie ma przypadków, żeby były wspólne i prywatne jednocześnie, po prostu nie rozróżnia ilości
if isinstance(bathtext, float):
return 0
if isinstance(bathtext, str):
n = bathtext.split(' ')[0]
try:
return float(n)
except ValueError:
return 1
else:
return bathtext
# TEST fix_private_bathrooms, fix_shared_bathrooms, total_baths
for i in df['bathrooms_text'].unique():
print(i, type(i), fix_shared_bathrooms(i))
2 baths <class 'str'> 0 1.5 baths <class 'str'> 0 2.5 baths <class 'str'> 0 3 baths <class 'str'> 0 1 shared bath <class 'str'> 1.0 1 bath <class 'str'> 0 1 private bath <class 'str'> 0 3.5 baths <class 'str'> 0 4 baths <class 'str'> 0 1.5 shared baths <class 'str'> 1.5 nan <class 'float'> 0 2 shared baths <class 'str'> 2.0 2.5 shared baths <class 'str'> 2.5 5.5 baths <class 'str'> 0 7.5 baths <class 'str'> 0 4.5 baths <class 'str'> 0 6 baths <class 'str'> 0 0 shared baths <class 'str'> 0.0 Half-bath <class 'str'> 0 Private half-bath <class 'str'> 0 0 baths <class 'str'> 0 5 baths <class 'str'> 0 8 baths <class 'str'> 0 3 shared baths <class 'str'> 3.0 8 shared baths <class 'str'> 8.0 4 shared baths <class 'str'> 4.0 Shared half-bath <class 'str'> 1 5 shared baths <class 'str'> 5.0 3.5 shared baths <class 'str'> 3.5 6 shared baths <class 'str'> 6.0 5.5 shared baths <class 'str'> 5.5 10 baths <class 'str'> 0 10 shared baths <class 'str'> 10.0 6.5 baths <class 'str'> 0 4.5 shared baths <class 'str'> 4.5
df['private_bathrooms'] = df['bathrooms_text'].apply(lambda x: fix_private_bathrooms(x))
df['shared_bathrooms'] = df['bathrooms_text'].apply(lambda x: fix_shared_bathrooms(x))
df['total_bathrooms'] = df['bathrooms_text'].apply(lambda x: total_baths(x))
df[['private_bathrooms', 'shared_bathrooms', 'total_bathrooms']].head()
| private_bathrooms | shared_bathrooms | total_bathrooms | |
|---|---|---|---|
| 0 | 2.0 | 0.0 | 2.0 |
| 1 | 2.0 | 0.0 | 2.0 |
| 2 | 2.0 | 0.0 | 2.0 |
| 3 | 1.5 | 0.0 | 1.5 |
| 4 | 2.5 | 0.0 | 2.5 |
from datetime import datetime
datetime_colnames = ['last_scraped', 'last_scraped', 'host_since', 'calendar_last_scraped', 'first_review', 'last_review']
df[datetime_colnames].head(3)
| last_scraped | last_scraped | host_since | calendar_last_scraped | first_review | last_review | |
|---|---|---|---|---|---|---|
| 0 | 2023-09-06 | 2023-09-06 | 2010-01-19 | 2023-09-06 | 2013-05-27 | 2023-06-26 |
| 1 | 2023-09-06 | 2023-09-06 | 2010-03-09 | 2023-09-06 | 2011-03-15 | 2023-08-15 |
| 2 | 2023-09-06 | 2023-09-06 | 2011-05-08 | 2023-09-06 | 2011-08-09 | 2023-08-13 |
test = df[datetime_colnames].head(3).loc[0][0]
print(test, type(test))
fixed = datetime.strptime(test, '%Y-%m-%d').date()
print(fixed, type(fixed))
2023-09-06 <class 'str'> 2023-09-06 <class 'datetime.date'>
# df[datetime_colnames].applymap(lambda x: datetime.strptime(x, '%Y-%m-%d').date() if isinstance(x, str) else x)
for i in datetime_colnames:
df[i] = df[i].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').date() if isinstance(x, str) else x)
print(df['last_scraped'][0])
type(df['last_scraped'][0])
2023-09-06
datetime.date
Udogodnienia warte uwzględnienia:
df['amenities'].notnull().sum()
18086
# # widok na kilka przykładów
# for i in range(10):
# display(df['amenities'][i].lower())
amenities = {'refrigerator': ['refrigerator'],
'tv' : ['tv', 'television'],
'iron': ['iron'],
'crib': ['crib'],
'parking': ['parking'],
'elevator': ['elevator'],
'AC/heating' : ['ac', 'air', 'conditioning', 'heat',],
'wifi': ['wifi', 'internet'],
'kitchen' : ['kitchen', 'microwave', 'stove', 'oven', 'dishwasher', 'toaster',],
'washer' : ['wash']}
for key in amenities:
print(f'Key: {key}, Value: {amenities[key]}')
Key: refrigerator, Value: ['refrigerator'] Key: tv, Value: ['tv', 'television'] Key: iron, Value: ['iron'] Key: crib, Value: ['crib'] Key: parking, Value: ['parking'] Key: elevator, Value: ['elevator'] Key: AC/heating, Value: ['ac', 'air', 'conditioning', 'heat'] Key: wifi, Value: ['wifi', 'internet'] Key: kitchen, Value: ['kitchen', 'microwave', 'stove', 'oven', 'dishwasher', 'toaster'] Key: washer, Value: ['wash']
def check_amenities(text, key):
text = text.lower()
return any(i in text for i in amenities[key])
for key in amenities:
df[key] = df['amenities'].apply(lambda x: check_amenities(x, key))
df['washer'].value_counts()
True 15077 False 3009 Name: washer, dtype: int64
df[[i for i in amenities]].tail(5)
| refrigerator | tv | iron | crib | parking | elevator | AC/heating | wifi | kitchen | washer | |
|---|---|---|---|---|---|---|---|---|---|---|
| 18081 | True | True | False | False | False | True | True | True | True | True |
| 18082 | True | True | True | True | True | False | True | True | True | True |
| 18083 | True | True | True | True | False | False | True | True | True | True |
| 18084 | False | True | False | False | True | False | True | True | True | True |
| 18085 | False | True | False | False | True | False | True | True | True | True |
host_response_time¶df['host_response_time'].value_counts()
within an hour 10879 within a few hours 2393 within a day 1558 a few days or more 293 Name: host_response_time, dtype: int64
response_ordinal = {
'within an hour' : 1,
'within a few hours' : 2,
'within a day' : 3,
'a few days or more' : 4
}
df['host_response_time'] = df['host_response_time'].map(response_ordinal)
df['host_response_time'].unique()
array([ 1., 2., 3., nan, 4.])
df.drop(columns=['bathrooms','bathrooms_text', 'calendar_updated'], inplace=True)
df.shape
(18086, 85)
object na float, DateTime, boolrefrigerator tv iron crib parking elevator AC/heating wifi kitchen washercalendar_updatedUwagi:
minimum_maximum_nights, maximum_maximum_nights i maximum_nights_avg_ntm mają bardzo wysokie wartościprice również ma miejscami bardzo wysokie wartościexempt w kolumnie license oznacza, że gospodarz nie potrzebuje licencji i może wynająć lokal na 1-11 miesięcy.cols = '''host_response_rate host_acceptance_rate host_listings_count host_total_listings_count accommodates total_bathrooms private_bathrooms shared_bathrooms beds price minimum_nights maximum_nights minimum_minimum_nights maximum_minimum_nights minimum_maximum_nights maximum_maximum_nights minimum_nights_avg_ntm maximum_nights_avg_ntm availability_30 availability_60 availability_90 availability_365 number_of_reviews number_of_reviews_ltm number_of_reviews_l30d review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
'''
cols = cols.replace('\n', '')
cols = cols.replace(' ', ' ')
cols = cols.split(' ')
for i in cols:
if ' ' in i:
cols.pop(i)
print(cols)
print(len(cols))
['host_response_rate', 'host_acceptance_rate', 'host_listings_count', 'host_total_listings_count', 'accommodates', 'total_bathrooms', 'private_bathrooms', 'shared_bathrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms', 'reviews_per_month'] 37
step = 0
for i in range(7):
display(df[cols[step: step+5]].describe().round(2)[1::]) # pomija `count`, już znane
print('')
step += 5
display(7, df[cols[35:]].describe().round(2)[1::])
| host_response_rate | host_acceptance_rate | host_listings_count | host_total_listings_count | accommodates | |
|---|---|---|---|---|---|
| mean | 0.94 | 0.87 | 41.13 | 53.55 | 3.36 |
| std | 0.14 | 0.23 | 87.47 | 109.17 | 2.14 |
| min | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
| 25% | 0.96 | 0.86 | 2.00 | 2.00 | 2.00 |
| 50% | 1.00 | 0.99 | 6.00 | 9.00 | 3.00 |
| 75% | 1.00 | 1.00 | 34.00 | 48.00 | 4.00 |
| max | 1.00 | 1.00 | 786.00 | 1853.00 | 16.00 |
| total_bathrooms | private_bathrooms | shared_bathrooms | beds | price | |
|---|---|---|---|---|---|
| mean | 1.36 | 1.05 | 0.31 | 2.36 | 162.80 |
| std | 0.69 | 0.83 | 0.67 | 1.83 | 1057.27 |
| min | 0.00 | 0.00 | 0.00 | 1.00 | 8.00 |
| 25% | 1.00 | 1.00 | 0.00 | 1.00 | 52.00 |
| 50% | 1.00 | 1.00 | 0.00 | 2.00 | 100.00 |
| 75% | 2.00 | 1.50 | 0.00 | 3.00 | 185.00 |
| max | 10.00 | 10.00 | 10.00 | 30.00 | 90000.00 |
| minimum_nights | maximum_nights | minimum_minimum_nights | maximum_minimum_nights | minimum_maximum_nights | |
|---|---|---|---|---|---|
| mean | 14.76 | 574.35 | 14.54 | 18.89 | 2.381127e+05 |
| std | 32.36 | 448.48 | 33.69 | 46.54 | 2.258195e+07 |
| min | 1.00 | 1.00 | 1.00 | 1.00 | 1.000000e+00 |
| 25% | 1.00 | 180.25 | 1.00 | 3.00 | 3.000000e+02 |
| 50% | 3.00 | 365.00 | 2.00 | 5.00 | 3.650000e+02 |
| 75% | 31.00 | 1125.00 | 31.00 | 31.00 | 1.125000e+03 |
| max | 1125.00 | 3000.00 | 1125.00 | 2705.00 | 2.147484e+09 |
| maximum_maximum_nights | minimum_nights_avg_ntm | maximum_nights_avg_ntm | availability_30 | availability_60 | |
|---|---|---|---|---|---|
| mean | 2.381551e+05 | 17.34 | 2.381370e+05 | 7.72 | 19.37 |
| std | 2.258195e+07 | 40.63 | 2.258195e+07 | 9.48 | 19.80 |
| min | 1.000000e+00 | 1.00 | 1.000000e+00 | 0.00 | 0.00 |
| 25% | 3.330000e+02 | 2.00 | 3.300000e+02 | 0.00 | 0.00 |
| 50% | 9.990000e+02 | 3.70 | 4.071000e+02 | 4.00 | 13.00 |
| 75% | 1.125000e+03 | 31.00 | 1.125000e+03 | 12.00 | 34.00 |
| max | 2.147484e+09 | 1125.00 | 2.147484e+09 | 30.00 | 60.00 |
| availability_90 | availability_365 | number_of_reviews | number_of_reviews_ltm | number_of_reviews_l30d | |
|---|---|---|---|---|---|
| mean | 36.43 | 171.82 | 42.22 | 11.44 | 0.92 |
| std | 30.56 | 129.47 | 85.73 | 23.27 | 1.99 |
| min | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 1.00 | 47.00 | 1.00 | 0.00 | 0.00 |
| 50% | 35.00 | 175.00 | 6.00 | 2.00 | 0.00 |
| 75% | 62.00 | 302.00 | 42.00 | 15.00 | 1.00 |
| max | 90.00 | 365.00 | 1817.00 | 836.00 | 91.00 |
| review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | |
|---|---|---|---|---|---|
| mean | 4.54 | 4.62 | 4.59 | 4.71 | 4.71 |
| std | 0.62 | 0.51 | 0.52 | 0.46 | 0.47 |
| min | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 4.41 | 4.50 | 4.45 | 4.64 | 4.64 |
| 50% | 4.68 | 4.75 | 4.72 | 4.85 | 4.85 |
| 75% | 4.90 | 4.93 | 4.92 | 5.00 | 5.00 |
| max | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 |
| review_scores_location | review_scores_value | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | |
|---|---|---|---|---|---|
| mean | 4.74 | 4.44 | 31.63 | 23.87 | 7.60 |
| std | 0.37 | 0.56 | 60.75 | 54.20 | 32.38 |
| min | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 |
| 25% | 4.67 | 4.28 | 1.00 | 0.00 | 0.00 |
| 50% | 4.83 | 4.56 | 5.00 | 2.00 | 0.00 |
| 75% | 5.00 | 4.78 | 28.00 | 19.00 | 2.00 |
| max | 5.00 | 5.00 | 294.00 | 294.00 | 233.00 |
7
| calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|
| mean | 0.08 | 1.44 |
| std | 0.74 | 1.85 |
| min | 0.00 | 0.01 |
| 25% | 0.00 | 0.24 |
| 50% | 0.00 | 0.85 |
| 75% | 0.00 | 2.10 |
| max | 12.00 | 55.02 |
minimum_maximum_nights, maximum_maximum_nights¶
Kolumny dot. liczby noclegów przedstawiają się następująco:
| Field | Type | Calculated | Description |
|---|---|---|---|
| minimum_nights | integer | minimum number of night stay for the listing (calendar rules may be different) | |
| maximum_nights | integer | maximum number of night stay for the listing (calendar rules may be different) | |
| minimum_minimum_nights | integer | y | the smallest minimum_night value from the calender (looking 365 nights in the future) |
| maximum_minimum_nights | integer | y | the largest minimum_night value from the calender (looking 365 nights in the future) |
| minimum_maximum_nights | integer | y | the smallest maximum_night value from the calender (looking 365 nights in the future) |
| maximum_maximum_nights | integer | y | the largest maximum_night value from the calender (looking 365 nights in the future) |
| minimum_nights_avg_ntm | numeric | y | the average minimum_night value from the calender (looking 365 nights in the future) |
| maximum_nights_avg_ntm | numeric | y | the average maximum_night value from the calender (looking 365 nights in the future) |
df[df['minimum_maximum_nights'] >= 3000]['minimum_maximum_nights']
48 3000 744 3000 2463 2147483647 2540 2147483647 9162 9999 Name: minimum_maximum_nights, dtype: int64
nights = '''minimum_nights
maximum_nights
minimum_minimum_nights
maximum_minimum_nights
minimum_maximum_nights
maximum_maximum_nights
minimum_nights_avg_ntm
maximum_nights_avg_ntm'''
df.iloc[48][nights.split('\n')]
minimum_nights 28 maximum_nights 3000 minimum_minimum_nights 28 maximum_minimum_nights 28 minimum_maximum_nights 3000 maximum_maximum_nights 3000 minimum_nights_avg_ntm 28.0 maximum_nights_avg_ntm 3000.0 Name: 48, dtype: object
print(df.query('minimum_nights != minimum_minimum_nights').shape)
print(df.query('minimum_nights != maximum_minimum_nights').shape)
print(df.query('maximum_nights != minimum_maximum_nights').shape)
print(df.query('minimum_nights != maximum_maximum_nights').shape)
(3545, 85) (5126, 85) (4153, 85) (17991, 85)
df.query('minimum_nights != minimum_minimum_nights')[nights.split('\n')].head()
| minimum_nights | maximum_nights | minimum_minimum_nights | maximum_minimum_nights | minimum_maximum_nights | maximum_maximum_nights | minimum_nights_avg_ntm | maximum_nights_avg_ntm | |
|---|---|---|---|---|---|---|---|---|
| 3 | 21 | 31 | 1 | 1 | 31 | 31 | 1.0 | 31.0 |
| 6 | 3 | 1125 | 2 | 6 | 1125 | 1125 | 3.3 | 1125.0 |
| 7 | 5 | 365 | 2 | 5 | 1125 | 1125 | 4.9 | 1125.0 |
| 14 | 2 | 364 | 1 | 4 | 364 | 364 | 2.0 | 364.0 |
| 20 | 1 | 1125 | 3 | 3 | 1125 | 1125 | 3.0 | 1125.0 |
display(df[df['minimum_maximum_nights'] < 3000]['minimum_maximum_nights'].describe().round(2))
print(f"Moda: {df[df['minimum_maximum_nights'] < 3000]['minimum_maximum_nights'].mode()}")
count 18081.00 mean 637.31 std 464.52 min 1.00 25% 300.00 50% 365.00 75% 1125.00 max 1125.00 Name: minimum_maximum_nights, dtype: float64
Moda: 0 1125 Name: minimum_maximum_nights, dtype: int64
df.loc[2463, 'minimum_maximum_nights'] = 365
df.loc[2540, 'minimum_maximum_nights'] = 365
fig = px.histogram(df[df['minimum_maximum_nights'] < 3000], x='minimum_maximum_nights', barmode='group')
fig.update_layout(
width=500, height=500, template='plotly_dark', bargap=0.05,
title='Minimum maximum nights poniżej 3000',
xaxis_title="Minimum maximum nights",
yaxis_title="Ilość")
fig = px.histogram(df, x='maximum_nights', barmode='group')
fig.update_layout(width=500, height=500, template='plotly_dark', bargap=0.05,)
fig = px.histogram(df, x='minimum_maximum_nights', barmode='group')
fig.update_layout(width=500, height=500, template='plotly_dark', bargap=0.05,)
i znacznie odstające wartości
df['price'].describe()[1:].round(2)
mean 162.80 std 1057.27 min 8.00 25% 52.00 50% 100.00 75% 185.00 max 90000.00 Name: price, dtype: float64
fig = px.histogram(df, x='price', barmode='group')
fig.update_layout(width=500, height=500, template='plotly_dark', bargap=0.05,)
prices = df[['listing_url','price']].sort_values(by='price')
prices
| listing_url | price | |
|---|---|---|
| 15499 | https://www.airbnb.com/rooms/871607319497888166 | 8.0 |
| 2940 | https://www.airbnb.com/rooms/13873341 | 9.0 |
| 3440 | https://www.airbnb.com/rooms/17100681 | 9.0 |
| 212 | https://www.airbnb.com/rooms/587203 | 9.0 |
| 4657 | https://www.airbnb.com/rooms/22218299 | 9.0 |
| ... | ... | ... |
| 8770 | https://www.airbnb.com/rooms/43479422 | 9200.0 |
| 7439 | https://www.airbnb.com/rooms/37486088 | 11000.0 |
| 5374 | https://www.airbnb.com/rooms/25927238 | 57731.0 |
| 9598 | https://www.airbnb.com/rooms/47725335 | 84999.0 |
| 3092 | https://www.airbnb.com/rooms/14484042 | 90000.0 |
18086 rows × 2 columns
prices.sample()
| listing_url | price | |
|---|---|---|
| 12133 | https://www.airbnb.com/rooms/639224424789847854 | 174.0 |
prices.query('price < 50').shape
(3984, 2)
prices.tail(30)
| listing_url | price | |
|---|---|---|
| 6220 | https://www.airbnb.com/rooms/31032688 | 2000.0 |
| 6051 | https://www.airbnb.com/rooms/29808581 | 2100.0 |
| 7389 | https://www.airbnb.com/rooms/36820054 | 2230.0 |
| 5984 | https://www.airbnb.com/rooms/29251355 | 2500.0 |
| 11704 | https://www.airbnb.com/rooms/588106770878222950 | 2500.0 |
| 5944 | https://www.airbnb.com/rooms/28874308 | 3000.0 |
| 5353 | https://www.airbnb.com/rooms/25758097 | 3000.0 |
| 9563 | https://www.airbnb.com/rooms/46731061 | 3528.0 |
| 8897 | https://www.airbnb.com/rooms/44000385 | 4000.0 |
| 8643 | https://www.airbnb.com/rooms/42893238 | 4124.0 |
| 16651 | https://www.airbnb.com/rooms/917928804879169822 | 4500.0 |
| 6366 | https://www.airbnb.com/rooms/32088715 | 5214.0 |
| 9102 | https://www.airbnb.com/rooms/45100047 | 5722.0 |
| 2952 | https://www.airbnb.com/rooms/14089552 | 6979.0 |
| 8639 | https://www.airbnb.com/rooms/42890622 | 7000.0 |
| 8772 | https://www.airbnb.com/rooms/43483528 | 8478.0 |
| 1878 | https://www.airbnb.com/rooms/7105437 | 8999.0 |
| 8782 | https://www.airbnb.com/rooms/43483589 | 9100.0 |
| 8779 | https://www.airbnb.com/rooms/43483560 | 9100.0 |
| 8778 | https://www.airbnb.com/rooms/43483536 | 9100.0 |
| 6775 | https://www.airbnb.com/rooms/33699122 | 9199.0 |
| 8764 | https://www.airbnb.com/rooms/43479394 | 9200.0 |
| 8766 | https://www.airbnb.com/rooms/43479417 | 9200.0 |
| 8768 | https://www.airbnb.com/rooms/43479419 | 9200.0 |
| 8771 | https://www.airbnb.com/rooms/43479425 | 9200.0 |
| 8770 | https://www.airbnb.com/rooms/43479422 | 9200.0 |
| 7439 | https://www.airbnb.com/rooms/37486088 | 11000.0 |
| 5374 | https://www.airbnb.com/rooms/25927238 | 57731.0 |
| 9598 | https://www.airbnb.com/rooms/47725335 | 84999.0 |
| 3092 | https://www.airbnb.com/rooms/14484042 | 90000.0 |
prices.query('price == 22').head(20)
| listing_url | price | |
|---|---|---|
| 11737 | https://www.airbnb.com/rooms/589115666385828682 | 22.0 |
| 3746 | https://www.airbnb.com/rooms/18420309 | 22.0 |
| 3557 | https://www.airbnb.com/rooms/17468995 | 22.0 |
| 6554 | https://www.airbnb.com/rooms/32736627 | 22.0 |
| 11780 | https://www.airbnb.com/rooms/607907530110718284 | 22.0 |
| 6221 | https://www.airbnb.com/rooms/31203362 | 22.0 |
| 6583 | https://www.airbnb.com/rooms/33211494 | 22.0 |
| 16787 | https://www.airbnb.com/rooms/922364898996026177 | 22.0 |
| 7164 | https://www.airbnb.com/rooms/36374867 | 22.0 |
| 16093 | https://www.airbnb.com/rooms/895158128888394391 | 22.0 |
| 3863 | https://www.airbnb.com/rooms/18697011 | 22.0 |
| 17866 | https://www.airbnb.com/rooms/964279024581469247 | 22.0 |
| 3082 | https://www.airbnb.com/rooms/15013369 | 22.0 |
| 3084 | https://www.airbnb.com/rooms/14299410 | 22.0 |
| 6586 | https://www.airbnb.com/rooms/32832065 | 22.0 |
| 4408 | https://www.airbnb.com/rooms/21366324 | 22.0 |
| 9897 | https://www.airbnb.com/rooms/49056712 | 22.0 |
| 9863 | https://www.airbnb.com/rooms/48866420 | 22.0 |
| 16221 | https://www.airbnb.com/rooms/902291658003414702 | 22.0 |
| 7213 | https://www.airbnb.com/rooms/36128368 | 22.0 |
Sprawdziłem niektóre apartamenty i ceny są absurdalne (również biorąc pod uwagę co jest oferowane). Niektóre oferty o bardzo wysokiej cenie, np.:
| listing_url | price | |
|---|---|---|
| 4159 | https://www.airbnb.com/rooms/20472746 | 1533.0 |
zdają się (relatywnie 😅) uzasadnione. Górną granicę ucięcia cen ustaliłbym na $2000. Jest to ostatnia sensownie wyglądająca oferta, po niej znajdują się już niemal tylko oferty błędne i te co już wygasły i nie da się ich sprawdzić.
Z kolei oferty o bardzo niskiej cenie są najczęściej niedostępne (cena nie jest możliwa do sprawdzenia) lub są źle podane / przestarzałe. Np.:
| listing_url | price | |
|---|---|---|
| 212 | https://www.airbnb.com/rooms/587203 | 9.0 |
w rzeczywistości jest podane jako 2778 zł za miesiąc / 30 / 5 = ok. $19
Sprawdzając oferty niższego przedziału, muszę stwierdzić że bezpieczniejszym progiem byłoby $20.
Tak więc po ucięciu tabeli do ofert od 20 do 2000, mamy 17783 rekordów, czyli usunięto niecałe 2% rekordów.
print(f"Total : {prices.query('price >= 20 & price <= 2000').shape[0]}")
print(f"Percentage left: {round(prices.query('price >= 20 & price <= 2000').shape[0] / df.shape[0], 3)}")
Total : 17783 Percentage left: 0.983
prices.query('price >= 20 & price <= 2000').describe()[1:].round(2)
| price | |
|---|---|
| mean | 142.83 |
| std | 146.41 |
| min | 20.00 |
| 25% | 55.00 |
| 50% | 100.00 |
| 75% | 185.00 |
| max | 2000.00 |
fig = px.histogram(prices.query('price >= 20 & price <= 2000'), x='price',
color_discrete_sequence=[px.colors.qualitative.Plotly[2]])
fig.update_layout(width=500, height=500, template='plotly_dark', bargap=0.05,)
fig.update_layout(
width=500, height=500, template='plotly_dark', bargap=0.05,
title='Histogram cen noclegów po filtrowaniu',
xaxis_title="Cena za nocleg w $",
yaxis_title="Ilość")
print(df.shape)
df = df.query('price >= 20 & price <= 2000')
df.shape
(18086, 85)
(17783, 85)
# df.drop(cols, axis=1).columns
categ = ['source', 'host_name', 'host_location',
'host_is_superhost', 'host_neighbourhood', 'host_verifications',
'host_has_profile_pic', 'host_identity_verified', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'property_type', 'room_type',
'has_availability', 'license', 'instant_bookable',
'refrigerator', 'tv', 'iron', 'crib', 'parking', 'elevator', 'AC/heating', 'wifi', 'kitchen', 'washer']
len(categ)
25
step = 0
for i in categ:
d = df[i].describe()[:2]
d['% NaN'] = round(df[i].isna().sum() / length * 100, 2)
u = df[i].value_counts()[:7]
x = pd.concat([d, u]).reset_index().rename(columns={'index':'values'})
# x.at[0, i] = x[0] / length
display(x)
print('')
| values | source | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 2 |
| 2 | % NaN | 0.0 |
| 3 | city scrape | 14622 |
| 4 | previous scrape | 3161 |
| values | host_name | |
|---|---|---|
| 0 | count | 17781 |
| 1 | unique | 3308 |
| 2 | % NaN | 0.01 |
| 3 | Ukio | 294 |
| 4 | Enter | 234 |
| 5 | SweetInn | 221 |
| 6 | AB Apartment | 184 |
| 7 | Acomodis Apartments | 158 |
| 8 | Selling | 149 |
| 9 | Stay U-Nique | 149 |
| values | host_location | |
|---|---|---|
| 0 | count | 13801 |
| 1 | unique | 389 |
| 2 | % NaN | 22.02 |
| 3 | Barcelona, Spain | 12403 |
| 4 | Spain | 168 |
| 5 | Madrid, Spain | 91 |
| 6 | Catalonia, Spain | 77 |
| 7 | London, United Kingdom | 53 |
| 8 | Paris, France | 39 |
| 9 | Sant Just Desvern, Spain | 30 |
| values | host_is_superhost | |
|---|---|---|
| 0 | count | 17365.000000 |
| 1 | mean | 0.176044 |
| 2 | % NaN | 2.310000 |
| 3 | 0.0 | 14308.000000 |
| 4 | 1.0 | 3057.000000 |
| values | host_neighbourhood | |
|---|---|---|
| 0 | count | 9381 |
| 1 | unique | 151 |
| 2 | % NaN | 46.46 |
| 3 | Dreta de l'Eixample | 1754 |
| 4 | Vila de Gràcia | 700 |
| 5 | El Raval | 639 |
| 6 | La Sagrada Família | 598 |
| 7 | Sant Antoni | 536 |
| 8 | El Poble-sec | 533 |
| 9 | el Fort Pienc | 414 |
| values | host_verifications | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 8 |
| 2 | % NaN | 0.0 |
| 3 | ['email', 'phone'] | 13598 |
| 4 | ['email', 'phone', 'work_email'] | 2352 |
| 5 | ['phone'] | 1227 |
| 6 | ['phone', 'work_email'] | 564 |
| 7 | ['email'] | 33 |
| 8 | [] | 6 |
| 9 | None | 2 |
| values | host_has_profile_pic | |
|---|---|---|
| 0 | count | 17781.000000 |
| 1 | mean | 0.981778 |
| 2 | % NaN | 0.010000 |
| values | host_identity_verified | |
|---|---|---|
| 0 | count | 17781.000000 |
| 1 | mean | 0.921827 |
| 2 | % NaN | 0.010000 |
| values | neighbourhood_cleansed | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 70 |
| 2 | % NaN | 0.0 |
| 3 | la Dreta de l'Eixample | 2206 |
| 4 | el Raval | 1443 |
| 5 | el Barri Gòtic | 1221 |
| 6 | Sant Pere, Santa Caterina i la Ribera | 1120 |
| 7 | la Vila de Gràcia | 1043 |
| 8 | l'Antiga Esquerra de l'Eixample | 1019 |
| 9 | la Sagrada Família | 1004 |
| values | neighbourhood_group_cleansed | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 10 |
| 2 | % NaN | 0.0 |
| 3 | Eixample | 6393 |
| 4 | Ciutat Vella | 4188 |
| 5 | Sants-Montjuïc | 1840 |
| 6 | Sant Martí | 1619 |
| 7 | Gràcia | 1556 |
| 8 | Sarrià-Sant Gervasi | 880 |
| 9 | Horta-Guinardó | 519 |
| values | property_type | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 61 |
| 2 | % NaN | 0.0 |
| 3 | Entire rental unit | 9180 |
| 4 | Private room in rental unit | 5361 |
| 5 | Entire serviced apartment | 495 |
| 6 | Room in hotel | 395 |
| 7 | Entire condo | 372 |
| 8 | Entire loft | 293 |
| 9 | Private room in condo | 219 |
| values | room_type | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 4 |
| 2 | % NaN | 0.0 |
| 3 | Entire home/apt | 10608 |
| 4 | Private room | 6904 |
| 5 | Shared room | 142 |
| 6 | Hotel room | 129 |
| values | has_availability | |
|---|---|---|
| 0 | count | 17783.000000 |
| 1 | mean | 0.937581 |
| 2 | % NaN | 0.000000 |
| 3 | 1 | 16673.000000 |
| 4 | 0 | 1110.000000 |
| values | license | |
|---|---|---|
| 0 | count | 12186 |
| 1 | unique | 5624 |
| 2 | % NaN | 30.95 |
| 3 | Exempt | 4682 |
| 4 | HUTB-000000 | 112 |
| 5 | HUTB-123456 | 25 |
| 6 | AJ000593 | 22 |
| 7 | HB004232 | 19 |
| 8 | HUTB-022013 | 18 |
| 9 | AJ000517 | 15 |
| values | instant_bookable | |
|---|---|---|
| 0 | count | 17783.000000 |
| 1 | mean | 0.381882 |
| 2 | % NaN | 0.000000 |
| 3 | 0 | 10992.000000 |
| 4 | 1 | 6791.000000 |
| values | refrigerator | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 2 |
| 2 | % NaN | 0.0 |
| 3 | True | 11851 |
| 4 | False | 5932 |
| values | tv | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 2 |
| 2 | % NaN | 0.0 |
| 3 | True | 14062 |
| 4 | False | 3721 |
| values | iron | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 2 |
| 2 | % NaN | 0.0 |
| 3 | True | 12588 |
| 4 | False | 5195 |
| values | crib | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 2 |
| 2 | % NaN | 0.0 |
| 3 | False | 12553 |
| 4 | True | 5230 |
| values | parking | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 2 |
| 2 | % NaN | 0.0 |
| 3 | False | 11053 |
| 4 | True | 6730 |
| values | elevator | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 2 |
| 2 | % NaN | 0.0 |
| 3 | True | 9631 |
| 4 | False | 8152 |
| values | AC/heating | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 2 |
| 2 | % NaN | 0.0 |
| 3 | True | 16867 |
| 4 | False | 916 |
| values | wifi | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 2 |
| 2 | % NaN | 0.0 |
| 3 | True | 17400 |
| 4 | False | 383 |
| values | kitchen | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 2 |
| 2 | % NaN | 0.0 |
| 3 | True | 16384 |
| 4 | False | 1399 |
| values | washer | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 2 |
| 2 | % NaN | 0.0 |
| 3 | True | 14839 |
| 4 | False | 2944 |
host_name wskazuje, że dość wiele ofert jest złożonych przez instytucje, nie indywidualne osobyHUTB lub HUT, Habitatge d’Ús Turístic (B dotyczy Barcelony) oznacza licencję, która umożliwia właśnie krótkoterminowy wynajem, obciążona specjalnym podatkiem. Pozostałe litery członu oznaczają id lokalu.Exempt oznacza, że licencja HUT nie była wymagana (dla ułatwienia będę odnosił się do tego i tak jako licencji)amenities okazało się sensownym zabiegiem, wiele z tych kategorii ma ciekawe proporcje, które mogą okazać się istotne w kontekście korelacjiHUTBpo poprawieniu licencji rozkład licencji rozkłada się następująco:
NaN nie powinno być trudne do uzupełnienia, o ile rzeczywiście licencje poprawnie odwzorowują ilość noclegów (HUTB zezwala na mniej niż 31 dni)
def check_license(lcode):
if isinstance(lcode, str):
code = ""
for c in lcode:
if c.isalpha():
code += c
else:
break
return code
else:
return lcode
check_license('HUTB-123456')
'HUTB'
for i in range(10):
print(
df.query('license.notnull() and license != "Exempt"')['license'].sample().apply(check_license)
)
9138 HUTB Name: license, dtype: object 517 HUTB Name: license, dtype: object 14941 HUTB Name: license, dtype: object 4479 HUTB Name: license, dtype: object 11786 HB Name: license, dtype: object 9616 HUTB Name: license, dtype: object 3809 HUTB Name: license, dtype: object 10299 HUTB Name: license, dtype: object 1349 HUTB Name: license, dtype: object 10735 HUTB Name: license, dtype: object
df['license_cleansed'] = df['license'].apply(check_license)
df['license_cleansed']
0 HUTB
1 HUTB
2 HUTB
3 HUTB
4 HUTB
...
18081 NaN
18082 HUTB
18083 NaN
18084 HUTB
18085 HUTB
Name: license_cleansed, Length: 17783, dtype: object
df['license_cleansed'].value_counts()
HUTB 6397
Exempt 4682
HB 725
AJ 167
hutb 37
ATB 31
Hutb 26
HUTS 22
15
hb 14
B 13
No 8
HT 7
HUTTE 6
HUTG 4
HV 3
ASB 3
Pendiente 2
T 2
Media 1
HUTb 1
LLB 1
HUTL 1
Monthly 1
HTUB 1
HUBT 1
Numero 1
MINIMUM 1
EP 1
asd 1
El 1
Temporada 1
MID 1
Eri 1
Tourist 1
Periodo 1
Ref 1
Carrera 1
Período 1
LY 1
NIF 1
Name: license_cleansed, dtype: int64
Jak widać jest wiele różnych licencji, najlepiej je uśrednić
def fix_license(lcode):
if isinstance(lcode, str):
if len(lcode) != 0 and lcode.upper()[0] == 'H' :
return 'HUTB'
else:
return lcode
else:
return lcode
fix_license('hutb')
'HUTB'
df['license_cleansed'] = df['license_cleansed'].apply(lambda x: fix_license(x))
df['license_cleansed'].value_counts()
HUTB 7245
Exempt 4682
AJ 167
ATB 31
15
B 13
No 8
ASB 3
Pendiente 2
T 2
Numero 1
Eri 1
LLB 1
MINIMUM 1
EP 1
asd 1
Temporada 1
MID 1
Periodo 1
Tourist 1
Monthly 1
Carrera 1
NIF 1
El 1
Media 1
Período 1
Ref 1
LY 1
Name: license_cleansed, dtype: int64
print(f"HUTB %: {round(df['license_cleansed'].value_counts()['HUTB'] / df.shape[0] * 100, 2)}")
print(f"Exempt %: {round(df['license_cleansed'].value_counts()['Exempt'] / df.shape[0]* 100, 2)}")
print(f"NaN %: {round(df['license_cleansed'].isna().sum() / df.shape[0]* 100, 2)}")
print('Pozostałe to 1,5%')
print('NaN nie powinno być trudne do uzupełnienia, o ile rzeczywiście licencje poprawnie odwzorowują ilość noclegów (HUTB umożliwia wynajem poniżej 31 dni)')
HUTB %: 40.74 Exempt %: 26.33 NaN %: 31.47 Pozostałe to 1,5% NaN nie powinno być trudne do uzupełnienia, o ile rzeczywiście licencje poprawnie odwzorowują ilość noclegów (HUTB umożliwia wynajem poniżej 31 dni)
grouped_nights = df.groupby('license_cleansed').describe()[nights.split('\n')].sort_values(('minimum_nights','count'), ascending=False)
grouped_nights
| minimum_nights | maximum_nights | ... | minimum_nights_avg_ntm | maximum_nights_avg_ntm | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
| license_cleansed | |||||||||||||||||||||
| HUTB | 7245.0 | 3.190200 | 13.376993 | 1.0 | 1.0 | 2.0 | 3.0 | 865.0 | 7245.0 | 622.537336 | ... | 3.70 | 865.0 | 7245.0 | 6.972791e+02 | 4.658369e+02 | 2.0 | 355.60 | 999.0 | 1125.0 | 9.999000e+03 |
| Exempt | 4682.0 | 16.792610 | 34.970667 | 1.0 | 1.0 | 4.0 | 31.0 | 1000.0 | 4682.0 | 504.924178 | ... | 31.00 | 1000.0 | 4682.0 | 6.429399e+02 | 4.490364e+02 | 1.0 | 340.00 | 365.0 | 1125.0 | 1.125000e+03 |
| AJ | 167.0 | 1.059880 | 0.237978 | 1.0 | 1.0 | 1.0 | 1.0 | 2.0 | 167.0 | 767.077844 | ... | 1.20 | 4.9 | 167.0 | 6.441222e+02 | 4.328183e+02 | 6.5 | 215.80 | 955.3 | 999.0 | 1.125000e+03 |
| ATB | 31.0 | 1.032258 | 0.179605 | 1.0 | 1.0 | 1.0 | 1.0 | 2.0 | 31.0 | 642.258065 | ... | 1.50 | 2.7 | 31.0 | 1.385480e+08 | 5.362931e+08 | 12.0 | 365.00 | 999.0 | 1125.0 | 2.147484e+09 |
| 15.0 | 21.600000 | 23.900986 | 1.0 | 2.5 | 30.0 | 31.5 | 90.0 | 15.0 | 536.333333 | ... | 31.50 | 90.0 | 15.0 | 6.700400e+02 | 4.497104e+02 | 60.0 | 315.00 | 400.0 | 1125.0 | 1.125000e+03 | |
| B | 13.0 | 3.307692 | 8.320503 | 1.0 | 1.0 | 1.0 | 1.0 | 31.0 | 13.0 | 1066.461538 | ... | 2.00 | 31.0 | 13.0 | 1.125000e+03 | 0.000000e+00 | 1125.0 | 1125.00 | 1125.0 | 1125.0 | 1.125000e+03 |
| No | 8.0 | 31.375000 | 0.517549 | 31.0 | 31.0 | 31.0 | 32.0 | 32.0 | 8.0 | 888.750000 | ... | 31.25 | 32.0 | 8.0 | 8.887500e+02 | 4.374500e+02 | 180.0 | 888.75 | 1125.0 | 1125.0 | 1.125000e+03 |
| ASB | 3.0 | 2.000000 | 1.000000 | 1.0 | 1.5 | 2.0 | 2.5 | 3.0 | 3.0 | 247.000000 | ... | 2.50 | 3.0 | 3.0 | 2.470000e+02 | 2.043820e+02 | 11.0 | 188.00 | 365.0 | 365.0 | 3.650000e+02 |
| T | 2.0 | 2.000000 | 0.000000 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1125.000000 | ... | 2.00 | 2.0 | 2.0 | 1.125000e+03 | 0.000000e+00 | 1125.0 | 1125.00 | 1125.0 | 1125.0 | 1.125000e+03 |
| Pendiente | 2.0 | 31.000000 | 0.000000 | 31.0 | 31.0 | 31.0 | 31.0 | 31.0 | 2.0 | 140.000000 | ... | 31.00 | 31.0 | 2.0 | 1.400000e+02 | 5.656854e+01 | 100.0 | 120.00 | 140.0 | 160.0 | 1.800000e+02 |
| NIF | 1.0 | 31.000000 | NaN | 31.0 | 31.0 | 31.0 | 31.0 | 31.0 | 1.0 | 370.000000 | ... | 31.00 | 31.0 | 1.0 | 3.700000e+02 | NaN | 370.0 | 370.00 | 370.0 | 370.0 | 3.700000e+02 |
| Temporada | 1.0 | 33.000000 | NaN | 33.0 | 33.0 | 33.0 | 33.0 | 33.0 | 1.0 | 1125.000000 | ... | 33.00 | 33.0 | 1.0 | 1.125000e+03 | NaN | 1125.0 | 1125.00 | 1125.0 | 1125.0 | 1.125000e+03 |
| Tourist | 1.0 | 1.000000 | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1125.000000 | ... | 1.00 | 1.0 | 1.0 | 1.125000e+03 | NaN | 1125.0 | 1125.00 | 1125.0 | 1125.0 | 1.125000e+03 |
| Período | 1.0 | 32.000000 | NaN | 32.0 | 32.0 | 32.0 | 32.0 | 32.0 | 1.0 | 330.000000 | ... | 31.90 | 31.9 | 1.0 | 3.300000e+02 | NaN | 330.0 | 330.00 | 330.0 | 330.0 | 3.300000e+02 |
| Periodo | 1.0 | 32.000000 | NaN | 32.0 | 32.0 | 32.0 | 32.0 | 32.0 | 1.0 | 330.000000 | ... | 32.00 | 32.0 | 1.0 | 3.300000e+02 | NaN | 330.0 | 330.00 | 330.0 | 330.0 | 3.300000e+02 |
| Numero | 1.0 | 3.000000 | NaN | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.0 | 1125.000000 | ... | 3.00 | 3.0 | 1.0 | 1.125000e+03 | NaN | 1125.0 | 1125.00 | 1125.0 | 1125.0 | 1.125000e+03 |
| Ref | 1.0 | 1.000000 | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1125.000000 | ... | 1.00 | 1.0 | 1.0 | 1.125000e+03 | NaN | 1125.0 | 1125.00 | 1125.0 | 1125.0 | 1.125000e+03 |
| MINIMUM | 1.0 | 31.000000 | NaN | 31.0 | 31.0 | 31.0 | 31.0 | 31.0 | 1.0 | 120.000000 | ... | 31.00 | 31.0 | 1.0 | 1.200000e+02 | NaN | 120.0 | 120.00 | 120.0 | 120.0 | 1.200000e+02 |
| Monthly | 1.0 | 60.000000 | NaN | 60.0 | 60.0 | 60.0 | 60.0 | 60.0 | 1.0 | 1125.000000 | ... | 32.00 | 32.0 | 1.0 | 1.125000e+03 | NaN | 1125.0 | 1125.00 | 1125.0 | 1125.0 | 1.125000e+03 |
| Media | 1.0 | 32.000000 | NaN | 32.0 | 32.0 | 32.0 | 32.0 | 32.0 | 1.0 | 330.000000 | ... | 32.00 | 32.0 | 1.0 | 3.300000e+02 | NaN | 330.0 | 330.00 | 330.0 | 330.0 | 3.300000e+02 |
| MID | 1.0 | 1.000000 | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1125.000000 | ... | 10.40 | 10.4 | 1.0 | 1.125000e+03 | NaN | 1125.0 | 1125.00 | 1125.0 | 1125.0 | 1.125000e+03 |
| LY | 1.0 | 90.000000 | NaN | 90.0 | 90.0 | 90.0 | 90.0 | 90.0 | 1.0 | 360.000000 | ... | 90.00 | 90.0 | 1.0 | 3.600000e+02 | NaN | 360.0 | 360.00 | 360.0 | 360.0 | 3.600000e+02 |
| LLB | 1.0 | 2.000000 | NaN | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 360.000000 | ... | 2.00 | 2.0 | 1.0 | 3.600000e+02 | NaN | 360.0 | 360.00 | 360.0 | 360.0 | 3.600000e+02 |
| Eri | 1.0 | 2.000000 | NaN | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 1125.000000 | ... | 2.00 | 2.0 | 1.0 | 1.125000e+03 | NaN | 1125.0 | 1125.00 | 1125.0 | 1125.0 | 1.125000e+03 |
| El | 1.0 | 32.000000 | NaN | 32.0 | 32.0 | 32.0 | 32.0 | 32.0 | 1.0 | 1000.000000 | ... | 32.00 | 32.0 | 1.0 | 1.000000e+03 | NaN | 1000.0 | 1000.00 | 1000.0 | 1000.0 | 1.000000e+03 |
| EP | 1.0 | 31.000000 | NaN | 31.0 | 31.0 | 31.0 | 31.0 | 31.0 | 1.0 | 300.000000 | ... | 31.00 | 31.0 | 1.0 | 3.000000e+02 | NaN | 300.0 | 300.00 | 300.0 | 300.0 | 3.000000e+02 |
| Carrera | 1.0 | 1.000000 | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1125.000000 | ... | 1.00 | 1.0 | 1.0 | 1.125000e+03 | NaN | 1125.0 | 1125.00 | 1125.0 | 1125.0 | 1.125000e+03 |
| asd | 1.0 | 30.000000 | NaN | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 1.0 | 1125.000000 | ... | 30.00 | 30.0 | 1.0 | 1.125000e+03 | NaN | 1125.0 | 1125.00 | 1125.0 | 1125.0 | 1.125000e+03 |
28 rows × 64 columns
# grouped_nights.to_excel('grouped nights.xlsx')
# def fix_other_licenses(min_nights, license):
# if license != 'HUTB' or license != 'Exempt':
# if min_nights >= 28:
# return 'Exempt'
# else:
# return 'HUTB'
# fix_other_licenses(32, 'HUTB')
def fix_other_licenses(min_nights, license):
if license == 'HUTB':
return 'HUTB'
elif license == 'Exempt':
return 'Exempt'
else:
if min_nights >= 28:
return 'Exempt'
else:
return 'HUTB'
fix_other_licenses(32, 'HUTB')
'HUTB'
df['license_cleansed'] = df.apply(lambda row: fix_other_licenses(row['minimum_nights'], row['license_cleansed']), axis=1)
grouped_nights = df.groupby('license_cleansed').describe()[nights.split('\n')].sort_values(('minimum_nights','count'), ascending=False)
grouped_nights
| minimum_nights | maximum_nights | ... | minimum_nights_avg_ntm | maximum_nights_avg_ntm | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
| license_cleansed | |||||||||||||||||||||
| Exempt | 8903.0 | 26.043132 | 37.487728 | 1.0 | 3.0 | 31.0 | 32.0 | 1125.0 | 8903.0 | 515.466472 | ... | 32.0 | 1125.0 | 8903.0 | 626.931461 | 4.357041e+02 | 1.0 | 330.0 | 365.0 | 1125.0 | 3.000000e+03 |
| HUTB | 8880.0 | 3.015541 | 12.133153 | 1.0 | 1.0 | 2.0 | 3.0 | 865.0 | 8880.0 | 630.362275 | ... | 3.5 | 865.0 | 8880.0 | 484362.419865 | 3.222655e+07 | 1.0 | 330.0 | 999.0 | 1125.0 | 2.147484e+09 |
2 rows × 64 columns
Licencje zostały z grubsza oczyszczone. Nie licząc błędów istniejących już w oryginalnie podanych klasach HUTB i Exempt, myślę że tak może już to zostać. Pozostało jeszcze tylko usnięcie license i utworzenie dummies dla license_cleansed.
dum_license = pd.get_dummies(df["license_cleansed"], drop_first=False)
df = pd.concat([df, dum_license], axis=1)
# df.drop('license_cleansed', axis=1, inplace=True)
df.drop('license', axis=1, inplace=True)
Robimy one-hot-encoding dla rodzaju pokoju, powinno mieć duże znaczenie dla ceny, w szczególności shared_room / private_room
dum_room_type = pd.get_dummies(df["room_type"], drop_first=False)
df = pd.concat([df, dum_room_type], axis=1)
# df.drop('ogrzewanie', axis=1, inplace=True)
df.head()
| id | listing_url | scrape_id | last_scraped | source | name | description | neighborhood_overview | picture_url | host_id | ... | wifi | kitchen | washer | license_cleansed | Exempt | HUTB | Entire home/apt | Hotel room | Private room | Shared room | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 18674 | https://www.airbnb.com/rooms/18674 | 20230906022853 | 2023-09-06 | city scrape | Rental unit in Barcelona · ★4.30 · 3 bedrooms ... | 110m2 apartment to rent in Barcelona. Located ... | Apartment in Barcelona located in the heart of... | https://a0.muscache.com/pictures/13031453/413c... | 71615 | ... | True | True | True | HUTB | 0 | 1 | 1 | 0 | 0 | 0 |
| 1 | 23197 | https://www.airbnb.com/rooms/23197 | 20230906022853 | 2023-09-06 | city scrape | Rental unit in Sant Adria de Besos · ★4.77 · 3... | Beautiful spacious apartment, large terrace, 5... | Strategically located in the Parc del Fòrum, a... | https://a0.muscache.com/pictures/miso/Hosting-... | 90417 | ... | True | True | True | HUTB | 0 | 1 | 1 | 0 | 0 | 0 |
| 2 | 117010 | https://www.airbnb.com/rooms/117010 | 20230906022853 | 2023-09-06 | previous scrape | Rental unit in Barcelona · ★4.55 · 3 bedrooms ... | Have an authentic Barcelona experience! Modern... | Apartment is located just 240 meters from Sagr... | https://a0.muscache.com/pictures/4052d8b5-7352... | 567180 | ... | True | True | True | HUTB | 0 | 1 | 1 | 0 | 0 | 0 |
| 3 | 32711 | https://www.airbnb.com/rooms/32711 | 20230906022853 | 2023-09-06 | city scrape | Rental unit in Barcelona · ★4.46 · 2 bedrooms ... | A lovely two bedroom apartment only 250 m from... | What's nearby <br />This apartment is located... | https://a0.muscache.com/pictures/357b25e4-f414... | 135703 | ... | True | True | True | HUTB | 0 | 1 | 1 | 0 | 0 | 0 |
| 4 | 118228 | https://www.airbnb.com/rooms/118228 | 20230906022853 | 2023-09-06 | previous scrape | Rental unit in Barcelona · ★4.56 · 3 bedrooms ... | Modern 100m2 recently decorated apartment with... | Apartment is located just 240 meters from Sagr... | https://a0.muscache.com/pictures/c4b34854-5db1... | 567180 | ... | True | True | True | HUTB | 0 | 1 | 1 | 0 | 0 | 0 |
5 rows × 91 columns
last_scrapedlast_scraped i calendar_last_scraped są jednakowe, i nie mają żadnych NaN, generalnie można usunąćdate_c = """last_scraped
host_since
calendar_last_scraped
first_review
last_review"""
date_c = date_c.split('\n')
date_c
['last_scraped', 'host_since', 'calendar_last_scraped', 'first_review', 'last_review']
df[date_c].describe()
| last_scraped | host_since | calendar_last_scraped | first_review | last_review | |
|---|---|---|---|---|---|
| count | 17783 | 17781 | 17783 | 13412 | 13412 |
| unique | 1 | 3257 | 1 | 3342 | 1604 |
| top | 2023-09-06 | 2020-05-15 | 2023-09-06 | 2023-07-09 | 2023-08-20 |
| freq | 17783 | 294 | 17783 | 36 | 367 |
df[date_c].isna().sum()
last_scraped 0 host_since 2 calendar_last_scraped 0 first_review 4371 last_review 4371 dtype: int64
fig = px.histogram(df[['first_review','last_review']],
)
fig.update_layout(
width=900, height=500, template='plotly_dark', bargap=0.05,
title='Histogram recenzji',
xaxis_title="",
yaxis_title="Częstość",
legend=dict(title=''))
# df['host_since'].max()
old = df[df['host_since'].notnull()]['host_since'].min()
new = df[df['host_since'].notnull()]['host_since'].max()
(new.year - old.year) * 12 - (new.month - old.month)
180
fig = px.histogram(df['host_since'], color_discrete_sequence=[px.colors.qualitative.Plotly[9]], nbins = 180)
fig.update_layout(
width=900, height=500, template='plotly_dark', bargap=0.05,
title='Histogram rozpoczęcia działalności gospodarzy',
xaxis_title="",
yaxis_title="Ilość gospodarzy",
legend=dict(title=''))
x.at[1, 'washer'] = 3
x
| values | washer | |
|---|---|---|
| 0 | count | 17783 |
| 1 | unique | 3 |
| 2 | % NaN | 0.0 |
| 3 | True | 14839 |
| 4 | False | 2944 |
Sprawdziłem kilka miejsc, żeby sprawdzić Plaça de Catalunya jako centrum Barcelony, do którego można będzie policzyć dystans z każdej oferty.
Metoda płaska:
def get_distance(lon, lat):
# plaza = [41.387016, 2.170047]
sagrada = [41.4036, 2.1744]
return (((lon - sagrada[0])**2 + (lat - sagrada[1])**2)**0.5) * 111139
print(get_distance(41.40556, 2.17262), '(m)')
294.2561129702953 (m)
Metoda Haversina $$ d = 2r\;arcsin \left( \sqrt{sin^2 \frac{\phi_2 - \phi_1}{2} + cos \phi_1 \times cos\phi_2 \times sin^2\frac{\lambda_2 - \lambda_1}{2}} \right)$$
Wychodzi bardzo podobnie, Haversin jest przydatny przy dużo większych dystansach, a jest bardziej kosztowny obliczniowo. (Być może jest biblioteka która policzyła by to nie w Pythonie).
import math
def get_haversin_distance(lon1, lat1, lon2, lat2):
R = 6371
dLat = math.radians(lat2 - lat1)
dLon = math.radians(lon2 - lon1)
lat1 = math.radians(lat1)
lat2 = math.radians(lat2)
a = math.sin(dLat / 2) * math.sin(dLat / 2) + \
math.sin(dLon / 2) * math.sin(dLon / 2) * math.cos(lat1) * math.cos(lat2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
distance = R * c
return distance
print(get_haversin_distance(41.40556, 2.17262, 41.4036, 2.1744), '(km)')
0.2942881317311864 (km)
Tworzymy nową kolumnę center_distance,
Sprawdzenie które miejsce jest najlepsze
def get_distance(lon, lat):
plaza = [41.387016, 2.170047] # 0.1421 zostajemy przy Plaza de Cataluña
sagrada = [41.4036, 2.1744] # 0.0868
gotic = [41.3833, 2.1767] # 0.1405
mila = [41.3954, 2.1618] # 0.0960
vicens = [41.4029, 2.1526] # 0.01
MuseuMares = [41.3844, 2.1778] # 0.1437
plazadejoan = [41.3873, 2.1881] #0.1322
picasso = [41.3852, 2.1809] #0.1435
place = plaza
return (((lon - place[0])**2 + (lat - place[1])**2)**0.5) * 111139
print(get_distance(41.40556, 2.17262), '(m)')
df['center_distance'] = get_distance(df['latitude'], df['longitude'])
df['price'].corr(df['center_distance'])
2080.7057154382396 (m)
-0.11007620393122651
# def get_distance(lon, lat, place):
# # plaza = [41.387016, 2.170047]
# # sagrada = [41.4036, 2.1744]
# return (((lon - place[0])**2 + (lat - place[1])**2)**0.5) * 111139
# # print(get_distance(41.40556, 2.17262), '(m)')
# places = {
# 'Basilica de la Sagrada Familia' : [41.4036, 2.1744],
# 'Barri Gòtic' : [41.3833, 2.1767],
# 'Casa Milà' : [41.3954, 2.1618],
# 'Casa Vicens' : [41.4029, 2.1526],
# 'Museu Frederic Marès': [41.3844, 2.1778],
# }
# for p in places:
# print(p, places[p], df['price'].corr(
# get_distance(df['latitude'], df['longitude'], places[p])
# ))
host_neighbourhood, neighbourhood i neighborhood_overview, host_about można spokojnie usunąć, mamy neighborhood_cleansed, host nie powinien być zbyt istotny. Można by się zastanowić, czy są jakieś praktyki gospodarzy, które mogłyby wpływać na cenę, np. czy żądają kaucji za zniszczenia mebli lub czy potrafią mówić po angielsku.NaN w bedrooms i beds; została uzupełniona na podstawie accomodateshost_identity_verified, host_listings_count, host_picture_url, host_name, host_sincen = df.isna().sum().sort_values(ascending=False)
n[n > 0]
fig = px.bar(n[n>2])
fig.update_layout(
width=1300, height=700, template='plotly_dark', bargap=0.05,
title='Wykres brakujących wartości',
xaxis_title="Kolumna",
yaxis_title="Ilość NaN",
legend=dict(title=''))
fig
n = df.isna().sum().sort_values(ascending=False)
n[n > 0]
host_neighbourhood 8402 neighbourhood 7821 neighborhood_overview 7821 host_about 6527 bedrooms 6098 review_scores_checkin 4453 review_scores_value 4453 review_scores_location 4452 review_scores_accuracy 4449 review_scores_cleanliness 4448 review_scores_communication 4447 first_review 4371 reviews_per_month 4371 last_review 4371 review_scores_rating 4371 host_location 3982 host_response_time 2812 host_response_rate 2812 host_acceptance_rate 2535 host_is_superhost 418 beds 307 description 109 host_has_profile_pic 2 host_identity_verified 2 host_thumbnail_url 2 host_total_listings_count 2 host_listings_count 2 host_picture_url 2 host_since 2 host_name 2 dtype: int64
s = df.query('host_name.isna()').iloc[0]
s[s.isna()]
host_name NaN host_since NaN host_location NaN host_about NaN host_response_time NaN host_response_rate NaN host_acceptance_rate NaN host_thumbnail_url NaN host_picture_url NaN host_neighbourhood NaN host_listings_count NaN host_total_listings_count NaN host_has_profile_pic NaN host_identity_verified NaN bedrooms NaN review_scores_accuracy NaN review_scores_cleanliness NaN review_scores_checkin NaN review_scores_communication NaN review_scores_location NaN review_scores_value NaN Name: 2038, dtype: object
s = df.query('host_name.isna()').iloc[1]
s[s.isna()]
neighborhood_overview NaN host_name NaN host_since NaN host_location NaN host_about NaN host_response_time NaN host_response_rate NaN host_acceptance_rate NaN host_thumbnail_url NaN host_picture_url NaN host_neighbourhood NaN host_listings_count NaN host_total_listings_count NaN host_has_profile_pic NaN host_identity_verified NaN neighbourhood NaN bedrooms NaN first_review NaN last_review NaN review_scores_rating NaN review_scores_accuracy NaN review_scores_cleanliness NaN review_scores_checkin NaN review_scores_communication NaN review_scores_location NaN review_scores_value NaN reviews_per_month NaN Name: 2064, dtype: object
beds i bedrooms po średnich wartościach dla accommodates¶merged_beds_df = pd.merge(
pd.merge(
df.groupby('accommodates').describe()['beds'],
df.groupby('accommodates')['beds'].apply(lambda x: x.isna().sum()),
on='accommodates').rename(columns={'beds':'NaN beds'}),
pd.merge(
df.groupby('accommodates').describe()['bedrooms'],
df.groupby('accommodates')['bedrooms'].apply(lambda x: x.isna().sum()),
on='accommodates').rename(columns={'bedrooms':'NaN bedrooms'}),
on='accommodates')
merged_beds_df
| count_x | mean_x | std_x | min_x | 25%_x | 50%_x | 75%_x | max_x | NaN beds | count_y | mean_y | std_y | min_y | 25%_y | 50%_y | 75%_y | max_y | NaN bedrooms | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| accommodates | ||||||||||||||||||
| 1 | 2493.0 | 1.199759 | 1.047147 | 1.0 | 1.0 | 1.0 | 1.00 | 26.0 | 116 | 484.0 | 1.826446 | 2.032562 | 1.0 | 1.0 | 1.0 | 1.00 | 9.0 | 2125 |
| 2 | 5689.0 | 1.198453 | 0.522178 | 1.0 | 1.0 | 1.0 | 1.00 | 16.0 | 137 | 2547.0 | 1.063997 | 0.276444 | 1.0 | 1.0 | 1.0 | 1.00 | 4.0 | 3279 |
| 3 | 1474.0 | 2.101085 | 0.664255 | 1.0 | 2.0 | 2.0 | 2.00 | 6.0 | 13 | 1137.0 | 1.611258 | 0.590548 | 1.0 | 1.0 | 2.0 | 2.00 | 6.0 | 350 |
| 4 | 3673.0 | 2.560305 | 0.876790 | 1.0 | 2.0 | 2.0 | 3.00 | 8.0 | 27 | 3437.0 | 1.875182 | 0.634994 | 1.0 | 1.0 | 2.0 | 2.00 | 10.0 | 263 |
| 5 | 1345.0 | 3.581413 | 0.966452 | 1.0 | 3.0 | 4.0 | 4.00 | 15.0 | 5 | 1328.0 | 2.578313 | 0.636645 | 1.0 | 2.0 | 3.0 | 3.00 | 5.0 | 22 |
| 6 | 1631.0 | 4.148375 | 1.162316 | 1.0 | 3.0 | 4.0 | 5.00 | 12.0 | 4 | 1603.0 | 2.744230 | 0.656825 | 1.0 | 2.0 | 3.0 | 3.00 | 6.0 | 32 |
| 7 | 319.0 | 5.175549 | 1.249112 | 1.0 | 4.0 | 5.0 | 6.00 | 10.0 | 1 | 315.0 | 3.396825 | 0.631944 | 2.0 | 3.0 | 3.0 | 4.00 | 5.0 | 5 |
| 8 | 471.0 | 5.636943 | 1.390635 | 1.0 | 5.0 | 6.0 | 6.00 | 13.0 | 2 | 464.0 | 3.482759 | 0.757771 | 1.0 | 3.0 | 4.0 | 4.00 | 7.0 | 9 |
| 9 | 78.0 | 6.346154 | 1.885191 | 1.0 | 5.0 | 7.0 | 7.75 | 10.0 | 1 | 73.0 | 4.000000 | 1.067187 | 1.0 | 3.0 | 4.0 | 5.00 | 6.0 | 6 |
| 10 | 165.0 | 7.096970 | 1.715156 | 4.0 | 6.0 | 7.0 | 8.00 | 13.0 | 1 | 162.0 | 4.160494 | 1.027048 | 1.0 | 4.0 | 4.0 | 5.00 | 7.0 | 4 |
| 11 | 12.0 | 7.750000 | 1.356801 | 6.0 | 7.0 | 8.0 | 8.00 | 11.0 | 0 | 12.0 | 4.083333 | 0.996205 | 2.0 | 4.0 | 4.0 | 4.25 | 6.0 | 0 |
| 12 | 41.0 | 9.756098 | 2.817627 | 5.0 | 8.0 | 9.0 | 11.00 | 18.0 | 0 | 40.0 | 4.975000 | 1.510434 | 1.0 | 4.0 | 5.0 | 6.00 | 9.0 | 1 |
| 13 | 9.0 | 8.888889 | 1.763834 | 5.0 | 8.0 | 9.0 | 10.00 | 11.0 | 0 | 9.0 | 5.333333 | 1.224745 | 4.0 | 5.0 | 5.0 | 6.00 | 8.0 | 0 |
| 14 | 20.0 | 10.600000 | 3.377947 | 5.0 | 8.5 | 10.0 | 14.00 | 18.0 | 0 | 19.0 | 5.421053 | 1.426565 | 1.0 | 5.0 | 6.0 | 6.00 | 7.0 | 1 |
| 15 | 23.0 | 12.869565 | 2.735363 | 8.0 | 11.0 | 13.0 | 15.00 | 18.0 | 0 | 23.0 | 6.347826 | 1.612697 | 4.0 | 6.0 | 6.0 | 6.00 | 10.0 | 0 |
| 16 | 33.0 | 15.030303 | 5.370550 | 1.0 | 12.0 | 15.0 | 18.00 | 30.0 | 0 | 32.0 | 8.093750 | 1.940392 | 4.0 | 6.0 | 8.0 | 9.00 | 12.0 | 1 |
merged_beds_df = merged_beds_df[['mean_x', 'mean_y']]
merged_beds_df['mean_x'] = merged_beds_df['mean_x'].round(decimals=0)
merged_beds_df['mean_y'] = merged_beds_df['mean_y'].round(decimals=0)
merged_beds_df = merged_beds_df.rename(columns={'mean_x':'avg_beds', 'mean_y':'avg_bedrooms'})
merged_beds_df
A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| avg_beds | avg_bedrooms | |
|---|---|---|
| accommodates | ||
| 1 | 1.0 | 2.0 |
| 2 | 1.0 | 1.0 |
| 3 | 2.0 | 2.0 |
| 4 | 3.0 | 2.0 |
| 5 | 4.0 | 3.0 |
| 6 | 4.0 | 3.0 |
| 7 | 5.0 | 3.0 |
| 8 | 6.0 | 3.0 |
| 9 | 6.0 | 4.0 |
| 10 | 7.0 | 4.0 |
| 11 | 8.0 | 4.0 |
| 12 | 10.0 | 5.0 |
| 13 | 9.0 | 5.0 |
| 14 | 11.0 | 5.0 |
| 15 | 13.0 | 6.0 |
| 16 | 15.0 | 8.0 |
avg_beds_bedrooms = merged_beds_df.to_dict()
avg_beds_bedrooms
{'avg_beds': {1: 1.0,
2: 1.0,
3: 2.0,
4: 3.0,
5: 4.0,
6: 4.0,
7: 5.0,
8: 6.0,
9: 6.0,
10: 7.0,
11: 8.0,
12: 10.0,
13: 9.0,
14: 11.0,
15: 13.0,
16: 15.0},
'avg_bedrooms': {1: 2.0,
2: 1.0,
3: 2.0,
4: 2.0,
5: 3.0,
6: 3.0,
7: 3.0,
8: 3.0,
9: 4.0,
10: 4.0,
11: 4.0,
12: 5.0,
13: 5.0,
14: 5.0,
15: 6.0,
16: 8.0}}
print(df['beds'].isna().sum())
print(df['bedrooms'].isna().sum())
307 6098
def beds_fix(current_beds, accomodates):
if pd.isna(current_beds):
return avg_beds_bedrooms['avg_beds'][accomodates]
else:
return current_beds
def bedroomss_fix(current_bedrooms, accomodates):
if pd.isna(current_bedrooms):
return avg_beds_bedrooms['avg_bedrooms'][accomodates]
else:
return current_bedrooms
beds_fix(None, 4)
3.0
print(df['beds'].isna().sum(), df['bedrooms'].isna().sum())
df['beds'] = df.apply(lambda row: beds_fix(row['beds'], row['accommodates']), axis=1)
df['bedrooms'] = df.apply(lambda row: beds_fix(row['bedrooms'], row['accommodates']), axis=1)
print(df['beds'].isna().sum(), df['bedrooms'].isna().sum())
307 6098 0 0
n = df.isna().sum().sort_values(ascending=False)
n[n > 0]
fig = px.bar(n[n>2])
fig.update_layout(
width=1300, height=700, template='plotly_dark', bargap=0.05,
title='Wykres brakujących wartości',
xaxis_title="Kolumna",
yaxis_title="Ilość NaN",
legend=dict(title=''))
fig
df.query('number_of_reviews==0').shape
(4371, 92)
Tak więc łóżka i sypialnie zostały rozwiązane. Pozostała jedynie kwestia recenzji. Ilość brakujących cech dotyczących recenzji jest bardzo zbliżona do siebie ~4450 i 4371. Są to wartości niedostępne ze względu na zupełny brak wystawionych recenzji.
Ponieważ korelacja ceny jest najwyższa z ilością wystawionych recenzji a nie faktyczną wartością recenzji, myślę że spokojnie można usunąć te kolumny, i zostawić jedynie number_of_reviews i number_of_reviews_ltm.
from pandas.api.types import is_numeric_dtype
numeric = []
for i in df.columns:
if is_numeric_dtype(df[i]):
numeric.append(i)
numeric.remove('id',)
numeric.remove('scrape_id')
numeric.remove('host_id')
numeric_df = df[numeric]
# numeric.remove('price')
ndf = df.dropna()
ndf.shape
(3339, 92)
numeric_df.shape
(17783, 63)
corr_df = numeric_df.corr(method='spearman', min_periods=33)
fig = px.imshow(corr_df.round(decimals=2) * 100, # w skali 1-100 żeby zajmowało mniej miejsca w kwadracie
color_continuous_scale='RdBu_r',
text_auto=True,
labels=dict(color="Correlation"),
x=numeric_df.columns,
y=numeric_df.columns)
fig.update_traces(textfont_size=8)
fig.update_xaxes(tickfont=dict(size=9))
fig.update_yaxes(tickfont=dict(size=9))
fig.update_layout(height=1200, )
fig.show()
corr_df.columns
Index(['host_response_time', 'host_response_rate', 'host_acceptance_rate',
'host_is_superhost', 'host_listings_count', 'host_total_listings_count',
'host_has_profile_pic', 'host_identity_verified', 'latitude',
'longitude', 'accommodates', 'bedrooms', 'beds', 'price',
'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
'maximum_minimum_nights', 'minimum_maximum_nights',
'maximum_maximum_nights', 'minimum_nights_avg_ntm',
'maximum_nights_avg_ntm', 'has_availability', 'availability_30',
'availability_60', 'availability_90', 'availability_365',
'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
'review_scores_rating', 'review_scores_accuracy',
'review_scores_cleanliness', 'review_scores_checkin',
'review_scores_communication', 'review_scores_location',
'review_scores_value', 'instant_bookable',
'calculated_host_listings_count',
'calculated_host_listings_count_entire_homes',
'calculated_host_listings_count_private_rooms',
'calculated_host_listings_count_shared_rooms', 'reviews_per_month',
'private_bathrooms', 'shared_bathrooms', 'total_bathrooms',
'refrigerator', 'tv', 'iron', 'crib', 'parking', 'elevator',
'AC/heating', 'wifi', 'kitchen', 'washer', 'Exempt', 'HUTB',
'Entire home/apt', 'Hotel room', 'Private room', 'Shared room',
'center_distance'],
dtype='object')
# fig = px.bar(corr_df.round(decimals=2)[1:])
fig = px.bar(numeric_df.corr(method='spearman', min_periods=33)['price'].sort_values(ascending=False)[1:],
color_discrete_sequence=[px.colors.qualitative.Plotly[3]])
fig.update_yaxes(range=[-0.75,0.75])
fig.update_xaxes(tickangle=50)
fig.update_layout(
showlegend=False,
title=dict(text="Korelacja cech wobec cen ofert", font=dict(size=22), automargin=True, yref='paper'),
xaxis_title="Cecha",
yaxis_title="Korelacja",
template='plotly_dark',
height=660,
width=1920
)
Exempt zostanie usunięta przed modelowaniem)Private_room i Entire home/apt mają wysokie korelacje: $\pm$ 0,57, pozostałe rodzaje już mają mniejsze, np. Shared_room ma zaledwie -0.07private_bathrooms ma z nich najwyższą korelację 0.5, total_bathrooms 0.19, a shared_bathrooms -0.48shared_bathrooms są przeważająco mniej luksusowe - niższa cena, liczba dostępnych gości, brak udogodnień. Oferty podobne do hostelicenter_distance) ma korelację ok. 0.14Amenities ma duży rozrzut koreleacji,crib która bardzo wysoko koreluje z accomodates - małe dzieci są uwzględniane jako normalny gośćwifi zdaje się być standardem - brak korelacjikitchen jest jedynym udogodnieniem z ujemną korelacją (co prawda praktyncznie nieważną) - pewnie mniej 'hotelowe' ofertySpośród kolumn ilości nocy - minimum_nights, maximum_minimum_nights etc. mają wysokie ujemne korelacje z ceną i oczywiście HUTB
maximum_nights i minimum_nights - teoretycznie gdyby ktoś korzystał z modelu żeby określić cenę wynajmu dla danej oferty, częściej będzie miał te wartości niż maximum_maximum_nights itp. Na przykład ktoś chciałby wycenić swoje mieszkanie na wynajem. Te cechy były by przydatne dla kogoś, kto już wynajmował mieszkanie i chce zaktualizować cenę wynajmu.
-host_response_time również ładnie się przedstawia z korelacją -0.29, dla przypomnienia, tak wyglądał ordinal encoding:
response_ordinal = {
'within an hour' : 1,
'within a few hours' : 2,
'within a day' : 3,
'a few days or more' : 4
}
calculated_host_listings_count_private_rooms - druga najwyższa ujemna korelacja, wynika z podobnej logiki co private vs shared_bathrooms - same pokoje są tańsze niż całe mieszkania, czyli calculated_host_listings_count_entire_homes
review_scores_rating jest zupełnie nieistotne: korelacja równa -0.03, podobnie review_score_rating (które dotyczy oceny względem kosztu noclegu).number_of_reviews_ltm i number_of_reviews, które również wysoce korelują ze sobą (0.83).print(df.shape)
dum_neighs = pd.get_dummies(df["neighbourhood_group_cleansed"], drop_first=False)
df = pd.concat([df, dum_neighs], axis=1)
df.shape
(17783, 92)
(17783, 102)
districts = df.columns[-10:].tolist()
districts.append('price')
fig = px.bar(df[districts].corr(method='spearman', min_periods=33)['price'].sort_values(ascending=False)[1:],
color_discrete_sequence=[px.colors.qualitative.Plotly[3]])
fig.update_yaxes(range=[-0.5,0.5])
fig.update_layout(
showlegend=False,
title=dict(text="Korelacja dzielnic do cen", font=dict(size=22), automargin=True, yref='paper'),
xaxis_title="Dzielnice",
yaxis_title="Korelacja",
template='plotly_dark',
width=800,
height=400
)
Jak widać dzielnice nie mają wysokiej korelacji, z najwyższą dla Eixample równą 0.19. Ponieważ dzielnice są istotne dla innych cech, np. center_distance myślę że warto je uwzględnić jako dummies w modelu. Ewentualnie możnaby zostawić jedynie kolumnę Eixample dla uproszczenia modelu.
from scipy.stats import chi2_contingency
def cramers_V(var1,var2) :
crosstab =np.array(pd.crosstab(var1,var2, rownames=None, colnames=None)) # Cross table building
stat = chi2_contingency(crosstab)[0] # Keeping of the test statistic of the Chi2 test
obs = np.sum(crosstab) # Number of observations
mini = min(crosstab.shape)-1 # Take the minimum value between the columns and the rows of the cross table
return np.sqrt(stat/(obs*mini))
nominal_cols = [
'host_is_superhost',
'HUTB',
'Entire home/apt', 'Private room', 'Hotel room', 'Shared room',
'refrigerator', 'tv', 'iron', 'crib', 'parking', 'elevator', 'AC/heating', 'wifi', 'kitchen', 'washer'
]
nominal_df = numeric_df[nominal_cols]
# nominal_df
rows= []
for var1 in nominal_df:
col = []
for var2 in nominal_df :
cramers =cramers_V(nominal_df[var1], nominal_df[var2])
col.append(round(cramers,2))
rows.append(col)
cramers_results = np.array(rows)
cramer_df = pd.DataFrame(cramers_results, columns = nominal_df.columns, index =nominal_df.columns)
# cramer_df
fig = px.imshow(cramer_df, text_auto=True,)
fig.update_traces(textfont_size=10)
fig.update_layout(height=1100, width=1100,
title='Współczynnik V Cramera dla wybranych cech kategorycznych')
fig.show()
Nie powiedziałbym że ten wykres jest szczególnie przydatny
gdf.head(3)
| index | neighbourhood | neighbourhood_group | geometry | |
|---|---|---|---|---|
| 0 | 0 | el Raval | Ciutat Vella | MULTIPOLYGON (((2.17737 41.37532, 2.17852 41.3... |
| 1 | 1 | el Barri Gòtic | Ciutat Vella | MULTIPOLYGON (((2.18287 41.38074, 2.18289 41.3... |
| 2 | 2 | la Dreta de l'Eixample | Eixample | MULTIPOLYGON (((2.17091 41.40182, 2.17332 41.3... |
mean_neighs = df.groupby('neighbourhood_cleansed')['price'].mean()
mean_neighs = mean_neighs.to_dict()
c_mean_neighs = df.groupby('neighbourhood_cleansed')['price'].describe()[['count', 'mean']].to_dict()
def get_mean_neigh(neigh_name):
if neigh_name in mean_neighs:
return mean_neighs[neigh_name]
else:
return None
def c_get_mean_neigh(neigh_name):
if neigh_name in c_mean_neighs['mean']:
return c_mean_neighs['mean'][neigh_name]
else:
return None
def c_get_count_neigh(neigh_name):
if neigh_name in c_mean_neighs['mean']:
return c_mean_neighs['count'][neigh_name]
else:
return 0
c_get_mean_neigh('el Turó de la Peira')
40.63157894736842
gdf['mean_neigh'] = gdf['neighbourhood'].apply(lambda x: get_mean_neigh(x))
gdf['count_neigh'] = gdf['neighbourhood'].apply(lambda x: c_get_count_neigh(x))
gdf = gdf.rename(columns={'mean_neigh':'Średnia cena za nocleg w dzielnicy [$]', 'count_neigh':'Ilość ofert w dzielnicy'})
gdf.head(3)
| index | neighbourhood | neighbourhood_group | geometry | Średnia cena za nocleg w dzielnicy [$] | Ilość ofert w dzielnicy | |
|---|---|---|---|---|---|---|
| 0 | 0 | el Raval | Ciutat Vella | MULTIPOLYGON (((2.17737 41.37532, 2.17852 41.3... | 111.116424 | 1443.0 |
| 1 | 1 | el Barri Gòtic | Ciutat Vella | MULTIPOLYGON (((2.18287 41.38074, 2.18289 41.3... | 133.708436 | 1221.0 |
| 2 | 2 | la Dreta de l'Eixample | Eixample | MULTIPOLYGON (((2.17091 41.40182, 2.17332 41.3... | 206.169538 | 2206.0 |
fig = px.choropleth_mapbox(gdf, geojson=geojson,
locations=gdf.index,
color='Ilość ofert w dzielnicy',
hover_name='neighbourhood_group',
hover_data={'neighbourhood': False, 'neighbourhood_group': False, 'index': False, 'Ilość ofert w dzielnicy':True},
featureidkey='properties.index',
mapbox_style='carto-positron',
center={'lat': 41.3933, 'lon': 2.145},
zoom=10.85,
opacity=0.75)
fig.update_traces(
name='Średnia cena za nocleg', selector=(dict(type='choroplethmapbox'))
)
fig.update_layout(
height=800, width=800,
# template='plotly_dark',
bargap=0.05,
title='Mapa dzielnic (mniejszych) Barcelony')
fig.show()
fig = px.choropleth_mapbox(gdf, geojson=geojson,
locations=gdf.index,
color='Średnia cena za nocleg w dzielnicy [$]',
hover_name='neighbourhood_group',
hover_data={'neighbourhood': False, 'neighbourhood_group': False, 'index': False, 'Ilość ofert w dzielnicy':True},
featureidkey='properties.index',
mapbox_style='carto-positron',
center={'lat': 41.3933, 'lon': 2.145},
zoom=10.85,
opacity=0.75)
fig.update_traces(
name='Średnia cena za nocleg', selector=(dict(type='choroplethmapbox'))
)
fig.update_layout(
height=800, width=1000,
# template='plotly_dark',
bargap=0.05,
title='Mapa dzielnic (mniejszych) Barcelony')
fig.show()
fig = px.choropleth_mapbox(gdf, geojson=geojson,
locations=gdf.index,
color='neighbourhood_group',
hover_name='neighbourhood_group',
hover_data={'neighbourhood': False, 'neighbourhood_group': False, 'index': False},
featureidkey='properties.index',
mapbox_style='carto-positron',
center={'lat': 41.3933, 'lon': 2.145},
zoom=10.85,
opacity=0.75)
fig.update_layout(
height=800, width=800,
# template='plotly_dark',
bargap=0.05,
title='Mapa dzielnic Barcelony')
fig.show()
# !pip install sweetviz -qq
# import sweetviz as sv
# import warnings
# warnings.filterwarnings('ignore')
# my_report = sv.analyze(df, pairwise_analysis="off")
# my_report.show_notebook()
fig = px.histogram(df, x='price', color_discrete_sequence=[px.colors.qualitative.Plotly[2]])
fig.update_xaxes(title_text='Cena za nocleg [$]')
fig.update_yaxes(title_text='Częstość')
fig.update_layout(title='Histogram cen za nocleg',
width=1100, height=400,)
grouped_amenities = df.groupby('license_cleansed')[[i for i in amenities]].sum().transpose()
grouped_amenities['sum'] = grouped_amenities['Exempt'] + grouped_amenities['HUTB']
grouped_amenities = grouped_amenities.sort_values('sum', ascending=False)
fig = px.bar(grouped_amenities[['Exempt','HUTB']])
fig.update_layout(height=500, width=1100,
title='Ilość udogodnień',
xaxis_title="Udogodnienie",
yaxis_title="Ilość",
legend_title = 'Licencja')
fig = px.histogram(df, x='room_type', y='price', color='license_cleansed', barmode='group', histfunc='avg')
fig.update_layout(height=500, width=1100,
title='Cena za nocleg zależnie od rodzaju zakwaterowania i rodzaju licencji',
xaxis_title="Rodzaj zakwaterowania",
yaxis_title="Średnia cena za nocleg [$]",
legend_title = 'Licencja')
fig.update_xaxes(categoryorder='array', categoryarray=['Hotel room', 'Entire home/apt', 'Shared room', 'Private room'])
fig = px.box(df, x='room_type', y='price', color='license_cleansed', )
fig.update_traces(jitter=0.25, marker=dict(size=3))
fig.update_layout(height=500, width=1100,
title='Wykres pudełkowy ceny za nocleg w zależności od licencji i rodzaju zakwaterowania',
xaxis_title="Rodzaj zakwaterowania",
yaxis_title="Cena za nocleg [$]",
legend_title = 'Licencja')
fig
fig = px.box(df.query('price <=500'), x='room_type', y='price', color='license_cleansed', )
fig.update_traces(jitter=0.25, marker=dict(size=3))
fig.update_layout(height=500, width=1100,
title='Wykres pudełkowy ceny (do $500/noc) za nocleg w zależności od licencji i rodzaju zakwaterowania',
xaxis_title="Rodzaj zakwaterowania",
yaxis_title="Cena za nocleg [$]",
legend_title = 'Licencja')
fig
grouped_distr = pd.merge(
df.groupby('neighbourhood_group_cleansed')['price'].describe(),
df.groupby('neighbourhood_group_cleansed')['center_distance'].describe(),
on='neighbourhood_group_cleansed',
).sort_values(by='mean_x', ascending=False)
#grouped_distr
grouped_distr
| count_x | mean_x | std_x | min_x | 25%_x | 50%_x | 75%_x | max_x | count_y | mean_y | std_y | min_y | 25%_y | 50%_y | 75%_y | max_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| neighbourhood_group_cleansed | ||||||||||||||||
| Eixample | 6393.0 | 169.692320 | 150.755844 | 20.0 | 66.00 | 147.0 | 221.00 | 2000.0 | 6393.0 | 1492.811202 | 640.788493 | 23.451382 | 1039.855381 | 1453.586658 | 1991.372788 | 3112.951897 |
| Sarrià-Sant Gervasi | 880.0 | 149.497727 | 172.308934 | 20.0 | 59.00 | 122.5 | 183.00 | 1500.0 | 880.0 | 3694.034081 | 1252.484645 | 1921.810642 | 2876.642777 | 3371.226469 | 4139.976383 | 9868.081543 |
| Sant Martí | 1619.0 | 145.017912 | 159.309617 | 20.0 | 54.00 | 100.0 | 190.50 | 2000.0 | 1619.0 | 3751.391000 | 1077.494645 | 1600.817873 | 2874.949281 | 3568.740549 | 4404.874294 | 6868.509183 |
| Gràcia | 1556.0 | 142.798843 | 154.641592 | 20.0 | 52.00 | 120.5 | 180.00 | 1500.0 | 1556.0 | 2373.654962 | 654.081453 | 1566.727177 | 1902.758078 | 2194.347859 | 2636.622502 | 5682.112705 |
| Les Corts | 351.0 | 133.846154 | 106.328893 | 20.0 | 62.00 | 113.0 | 166.00 | 1000.0 | 351.0 | 4386.130585 | 926.871102 | 2984.518282 | 3615.895271 | 4310.096066 | 4992.869648 | 7229.699586 |
| Ciutat Vella | 4188.0 | 124.504059 | 133.453096 | 20.0 | 51.00 | 80.0 | 150.25 | 1600.0 | 4188.0 | 1095.661683 | 517.282337 | 132.004175 | 761.616842 | 1032.218828 | 1291.133506 | 2979.583787 |
| Sants-Montjuïc | 1840.0 | 117.895109 | 125.607801 | 20.0 | 50.00 | 80.5 | 157.25 | 1790.0 | 1840.0 | 2983.038600 | 1235.827991 | 1334.991640 | 1738.282970 | 3030.268516 | 4056.355592 | 5411.672584 |
| Horta-Guinardó | 519.0 | 93.630058 | 134.332088 | 20.0 | 40.00 | 60.0 | 104.00 | 1800.0 | 519.0 | 3690.854151 | 932.979459 | 2251.691936 | 2919.209270 | 3550.261859 | 4219.209881 | 6883.726557 |
| Sant Andreu | 248.0 | 75.838710 | 72.944867 | 20.0 | 34.75 | 55.0 | 85.25 | 700.0 | 248.0 | 4897.537865 | 1047.854364 | 3622.339033 | 4037.163865 | 4516.946084 | 5767.867382 | 7900.081680 |
| Nou Barris | 189.0 | 73.206349 | 122.095423 | 20.0 | 30.00 | 50.0 | 75.00 | 1533.0 | 189.0 | 5550.214267 | 852.126785 | 4399.795130 | 4836.235167 | 5266.269047 | 6241.164125 | 8104.114307 |
pcr_df = df.groupby(['neighbourhood_group_cleansed', 'license_cleansed'])['price', 'center_distance', 'review_scores_rating'].mean().reset_index()
Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
count_ddf = df.groupby(['neighbourhood_group_cleansed', 'license_cleansed'])['price', 'center_distance', 'review_scores_rating'].count().reset_index()
Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
dist_count_order = df.groupby('neighbourhood_group_cleansed')['id'].count().sort_values(ascending=False).index.tolist()
# dist_count_order
fig = px.bar(count_ddf, x='neighbourhood_group_cleansed', y='price', color='license_cleansed')
fig.update_layout(title='Ilość ofert w danej dzielnicy',
width=1100, height=450,
xaxis_title='',
yaxis_title= "Ilość ofert",
legend_title = 'Licencja'
)
fig.update_xaxes(categoryorder='array', categoryarray=dist_count_order)
fig
fig = px.bar(pcr_df, x='neighbourhood_group_cleansed', y='review_scores_rating', color='license_cleansed', barmode='group')
fig.update_layout(title='Średnia ocena zależnie od licencji dla danej dzielnicy',
width=1100, height=450,
xaxis_title='',
yaxis_title= "Średnia ocena",
legend_title = 'Licencja'
)
fig.update_xaxes(categoryorder='array', categoryarray=['Hotel room', 'Entire home/apt', 'Shared room', 'Private room'])
fig.show()
fig = px.box(df.sort_values(by='license_cleansed'), x='neighbourhood_group_cleansed', y='review_scores_rating', color='license_cleansed')
fig.update_layout(title='Oceny zależnie od licencji i dzielnicy',
width=1100, height=450,
xaxis_title='',
yaxis_title= "Ocena",
legend_title = 'Licencja'
)
fig.update_xaxes(categoryorder='array', categoryarray=['Hotel room', 'Entire home/apt', 'Shared room', 'Private room'])
fig.show()
fig = px.histogram(df, x='center_distance', color_discrete_sequence=[px.colors.qualitative.Plotly[4]])
fig.update_xaxes(title_text='Dystans do centrum [m]')
fig.update_yaxes(title_text='Częstość')
fig.update_layout(title='Histogram odległości do centrum Barcelony (Plaza de Cataluña)',
width=1100, height=400,)
#@title Ceny i dystans w dzielnicach
lines = [
'Zrobienie one-hot encoding dla dzielnic',
'będzie przydatne. Jest duża zależność',
'między ceną a dzielnicą i nie jest ona',
'jednoznaczna z odległością do centrum.',
'',
'np. Ciutat Vella ma najniższe ceny, ale',
'jej oferty są najbliżej centrum.'
]
text = '<br>'.join(lines) ### kod do zawijania adnotacji na plotly
trace1 = go.Bar(x=grouped_distr.index, y=grouped_distr['mean_x'], ### Złożony wykres słupkowy
name='Średnia cena za nocleg [$]',
marker=dict(color='#00CC96'))
trace2 = go.Bar(x=grouped_distr.index, y=grouped_distr['mean_y'],
name='Średni dystans do centrum [m]',
marker=dict(color='#FFA15A'))
fig = make_subplots(rows=2, cols=1)
fig.add_trace(trace1, row=1, col=1)
fig.add_trace(trace2, row=2, col=1)
fig.update_layout(title='Średnia cena za nocleg i średni dystans do centrum dla ofert z danej dzielnicy Barcelony',
width=1100, height=700,
margin=dict(r=290))
fig.update_yaxes(title_text='Średnia cena za nocleg [$]', row=1, col=1)
fig.update_yaxes(title_text='Średni dystans do centrum [m]', row=2, col=1)
fig.add_annotation(dict(font=dict(size=12,),
width=350,
x=1.014,
y=0.90,
showarrow=False,
text=text,
align='left',
xanchor='left',
xref="paper",
yref="paper"))
fig.show()
#@title Ceny i dystans w dzielnicach
lines = [
'Zrobienie one-hot encoding dla dzielnic',
'będzie przydatne. Jest duża zależność',
'między ceną a dzielnicą i nie jest ona',
'jednoznaczna z odległością do centrum.',
'',
'np. Ciutat Vella ma najniższe ceny, ale',
'jej oferty są najbliżej centrum.'
]
text = '<br>'.join(lines) ### kod do zawijania adnotacji na plotly
trace1 = go.Box(x=df['neighbourhood_group_cleansed'], y=df.query('price <=500')['price'], ### Złożony wykres słupkowy
name='Średnia cena za nocleg [$]',
marker=dict(color='#00CC96'))
trace2 = go.Box(x=df['neighbourhood_group_cleansed'], y=df['center_distance'],
name='Średni dystans do centrum [m]',
marker=dict(color='#FFA15A'))
fig = make_subplots(rows=2, cols=1)
fig.add_trace(trace1, row=1, col=1)
fig.add_trace(trace2, row=2, col=1)
fig.update_traces(jitter=0.25, marker=dict(size=2.5))
fig.update_layout(title='Ceny za nocleg (do $500 / noc) i dystans do centrum dla ofert z danej dzielnicy Barcelony',
width=1100, height=700,
margin=dict(r=290))
fig.update_yaxes(title_text='Cena za nocleg [$]', row=1, col=1)
fig.update_yaxes(title_text='Średni dystans do centrum [m]', row=2, col=1)
fig.update_xaxes(categoryorder='array', categoryarray=grouped_distr.index.tolist())
fig.add_annotation(dict(font=dict(size=12,),
width=350,
x=1.014,
y=0.90,
showarrow=False,
text=text,
align='left',
xanchor='left',
xref="paper",
yref="paper"))
fig.show()
#@title Ceny i dystans w dzielnicach
lines = [
'Zrobienie one-hot encoding dla dzielnic',
'będzie przydatne. Jest duża zależność',
'między ceną a dzielnicą i nie jest ona',
'jednoznaczna z odległością do centrum.',
'',
'np. Ciutat Vella ma najniższe ceny, ale',
'jej oferty są najbliżej centrum.'
]
text = '<br>'.join(lines) ### kod do zawijania adnotacji na plotly
trace1 = go.Box(x=df['neighbourhood_group_cleansed'], y=df['price'], ### Złożony wykres słupkowy
name='Średnia cena za nocleg [$]',
marker=dict(color='#00CC96'))
trace2 = go.Box(x=df['neighbourhood_group_cleansed'], y=df['center_distance'],
name='Średni dystans do centrum [m]',
marker=dict(color='#FFA15A'))
fig = make_subplots(rows=2, cols=1)
fig.add_trace(trace1, row=1, col=1)
fig.add_trace(trace2, row=2, col=1)
fig.update_layout(title='Ceny za nocleg i dystans do centrum dla ofert z danej dzielnicy Barcelony',
width=1100, height=700,
margin=dict(r=290))
fig.update_yaxes(title_text='Średnia cena za nocleg [$]', row=1, col=1)
fig.update_yaxes(title_text='Średni dystans do centrum [m]', row=2, col=1)
fig.update_xaxes(categoryorder='array', categoryarray=grouped_distr.index.tolist())
fig.add_annotation(dict(font=dict(size=12,),
width=350,
x=1.014,
y=0.90,
showarrow=False,
text=text,
align='left',
xanchor='left',
xref="paper",
yref="paper"))
fig.show()
Ze względu na fakt, że kaleido z jakiegoś powodu nie działa, i nie jestem w stanie wyświetlić wykresów jako .svg lub .png, wykresy powodują dość duże spowolnienie działania komputera. Poniżej do zminienia czy wykresy mają być wyświetlane czy nie.
show_large_plots = True
review_cols = [c for c in df.columns if 'review' in c]
#@title Recenzje - histogramy
if show_large_plots:
fig = make_subplots(rows=4, cols=4)
fig.add_trace(go.Histogram(x=df['number_of_reviews'] , name=review_cols[0] ), row=1, col=1, )
fig.add_trace(go.Histogram(x=df['number_of_reviews_ltm'] , name=review_cols[1] ), row=1, col=2, )
fig.add_trace(go.Histogram(x=df['number_of_reviews_l30d'] , name=review_cols[2] ), row=1, col=3, )
fig.add_trace(go.Histogram(x=df['first_review'] , name=review_cols[3] ), row=1, col=4, )
fig.add_trace(go.Histogram(x=df['last_review'] , name=review_cols[4] ), row=2, col=1, )
fig.add_trace(go.Histogram(x=df['review_scores_rating'] , name=review_cols[5] ), row=2, col=2, )
fig.add_trace(go.Histogram(x=df['review_scores_accuracy'] , name=review_cols[6] ), row=2, col=3, )
fig.add_trace(go.Histogram(x=df['review_scores_cleanliness'] , name=review_cols[7] ), row=2, col=4, )
fig.add_trace(go.Histogram(x=df['review_scores_checkin'] , name=review_cols[8] ), row=3, col=1, )
fig.add_trace(go.Histogram(x=df['review_scores_communication'] , name=review_cols[9] ), row=3, col=2, )
fig.add_trace(go.Histogram(x=df['review_scores_location'] , name=review_cols[10] ), row=3, col=3, )
fig.add_trace(go.Histogram(x=df['review_scores_value'] , name=review_cols[11] ), row=3, col=4, )
fig.add_trace(go.Histogram(x=df['reviews_per_month'] , name=review_cols[12] ), row=4, col=1, )
fig.update_xaxes(title_text="Number of Reviews" , row=1, col=1, title_standoff=0)
fig.update_xaxes(title_text="Number of Reviews LTM" , row=1, col=2, title_standoff=0)
fig.update_xaxes(title_text="Number of Reviews L30D" , row=1, col=3, title_standoff=0)
fig.update_xaxes(title_text="First Review" , row=1, col=4, title_standoff=0)
fig.update_xaxes(title_text="Last Review" , row=2, col=1, title_standoff=0)
fig.update_xaxes(title_text="Review Scores Rating" , row=2, col=2, title_standoff=0)
fig.update_xaxes(title_text="Review Scores Accuracy" , row=2, col=3, title_standoff=0)
fig.update_xaxes(title_text="Review Scores Cleanliness" , row=2, col=4, title_standoff=0)
fig.update_xaxes(title_text="Review Scores Checkin" , row=3, col=1, title_standoff=0)
fig.update_xaxes(title_text="Review Scores Communication" , row=3, col=2, title_standoff=0)
fig.update_xaxes(title_text="Review Scores Location" , row=3, col=3, title_standoff=0)
fig.update_xaxes(title_text="Review Scores Value" , row=3, col=4, title_standoff=0)
fig.update_xaxes(title_text="Reviews per Month" , row=4, col=1, title_standoff=0)
fig.update_layout(title='Histogramy wartości dotyczących recenzji',
margin=dict(l=50, r=50, t=70, b=70),
height=900)
fig.show()
#@title Recenzje a cena
if show_large_plots:
fig = make_subplots(
rows=4, cols=4,
specs=[
[{}, {}, {}, {}],
[{}, {}, {}, {}],
[{}, {}, {}, {}],
[{}, {"rowspan":1, "colspan": 3}, None, None,]
],
print_grid=False)
fig.add_trace(go.Scatter(x=df['number_of_reviews'] , mode='markers', y=df['price'] , name=review_cols[0] ) , row=1, col=1, )
fig.add_trace(go.Scatter(x=df['number_of_reviews_ltm'] , mode='markers', y=df['price'] , name=review_cols[1] ) , row=1, col=2, )
fig.add_trace(go.Scatter(x=df['number_of_reviews_l30d'] , mode='markers', y=df['price'] , name=review_cols[2] ) , row=1, col=3, )
fig.add_trace(go.Scatter(x=df['first_review'] , mode='markers', y=df['price'] , name=review_cols[3] ) , row=1, col=4, )
fig.add_trace(go.Scatter(x=df['last_review'] , mode='markers', y=df['price'] , name=review_cols[4] ) , row=2, col=1, )
fig.add_trace(go.Scatter(x=df['review_scores_rating'] , mode='markers', y=df['price'] , name=review_cols[5] ) , row=2, col=2, )
fig.add_trace(go.Scatter(x=df['review_scores_accuracy'] , mode='markers', y=df['price'] , name=review_cols[6] ) , row=2, col=3, )
fig.add_trace(go.Scatter(x=df['review_scores_cleanliness'] , mode='markers', y=df['price'] , name=review_cols[7] ) , row=2, col=4, )
fig.add_trace(go.Scatter(x=df['review_scores_checkin'] , mode='markers', y=df['price'] , name=review_cols[8] ) , row=3, col=1, )
fig.add_trace(go.Scatter(x=df['review_scores_communication'] , mode='markers', y=df['price'] , name=review_cols[9] ) , row=3, col=2, )
fig.add_trace(go.Scatter(x=df['review_scores_location'] , mode='markers', y=df['price'] , name=review_cols[10] ) , row=3, col=3, )
fig.add_trace(go.Scatter(x=df['review_scores_value'] , mode='markers', y=df['price'] , name=review_cols[11] ) , row=3, col=4, )
fig.add_trace(go.Scatter(x=df['reviews_per_month'] , mode='markers', y=df['price'] , name=review_cols[12] ,marker=dict(color='#636EFA')) , row=4, col=1, )
fig.add_trace(go.Histogram(x=df['price'] , marker=dict(color='#00CC96') , name='price' ) , row=4, col=2, )
fig.update_xaxes(title_text="Number of Reviews" , row=1, col=1, title_standoff=0)
fig.update_xaxes(title_text="Number of Reviews LTM" , row=1, col=2, title_standoff=0)
fig.update_xaxes(title_text="Number of Reviews L30D" , row=1, col=3, title_standoff=0)
fig.update_xaxes(title_text="First Review" , row=1, col=4, title_standoff=0)
fig.update_xaxes(title_text="Last Review" , row=2, col=1, title_standoff=0)
fig.update_xaxes(title_text="Review Scores Rating" , row=2, col=2, title_standoff=0)
fig.update_xaxes(title_text="Review Scores Accuracy" , row=2, col=3, title_standoff=0)
fig.update_xaxes(title_text="Review Scores Cleanliness" , row=2, col=4, title_standoff=0)
fig.update_xaxes(title_text="Review Scores Checkin" , row=3, col=1, title_standoff=0)
fig.update_xaxes(title_text="Review Scores Communication" , row=3, col=2, title_standoff=0)
fig.update_xaxes(title_text="Review Scores Location" , row=3, col=3, title_standoff=0)
fig.update_xaxes(title_text="Review Scores Value" , row=3, col=4, title_standoff=0)
fig.update_xaxes(title_text="Reviews per Month" , row=4, col=1, title_standoff=0)
fig.update_xaxes(title_text="Cena za nocleg" , row=4, col=2, title_standoff=0)
fig.update_yaxes(title_text="Częstość" , row=4, col=2, title_standoff=0)
fig.update_layout(title='Histogramy wartości dotyczących recenzji',
margin=dict(l=50, r=50, t=70, b=70),
height=900)
fig.show()
To już spoza zakresu zadania, ale będzie do portfolio.
Do modelowania zostanie wykorzystany XGBoost ze względu na jego niezawodność oraz SHAP do analizy.
model_cols = corr_df.columns
model_cols
Index(['host_response_time', 'host_response_rate', 'host_acceptance_rate',
'host_is_superhost', 'host_listings_count', 'host_total_listings_count',
'host_has_profile_pic', 'host_identity_verified', 'latitude',
'longitude', 'accommodates', 'bedrooms', 'beds', 'price',
'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
'maximum_minimum_nights', 'minimum_maximum_nights',
'maximum_maximum_nights', 'minimum_nights_avg_ntm',
'maximum_nights_avg_ntm', 'has_availability', 'availability_30',
'availability_60', 'availability_90', 'availability_365',
'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
'review_scores_rating', 'review_scores_accuracy',
'review_scores_cleanliness', 'review_scores_checkin',
'review_scores_communication', 'review_scores_location',
'review_scores_value', 'instant_bookable',
'calculated_host_listings_count',
'calculated_host_listings_count_entire_homes',
'calculated_host_listings_count_private_rooms',
'calculated_host_listings_count_shared_rooms', 'reviews_per_month',
'private_bathrooms', 'shared_bathrooms', 'total_bathrooms',
'refrigerator', 'tv', 'iron', 'crib', 'parking', 'elevator',
'AC/heating', 'wifi', 'kitchen', 'washer', 'Exempt', 'HUTB',
'Entire home/apt', 'Hotel room', 'Private room', 'Shared room',
'center_distance'],
dtype='object')
df[model_cols].isna().sum()[:10].sort_values(ascending=False)
host_response_time 2812 host_response_rate 2812 host_acceptance_rate 2535 host_is_superhost 418 host_listings_count 2 host_total_listings_count 2 host_has_profile_pic 2 host_identity_verified 2 latitude 0 longitude 0 dtype: int64
df['host_is_superhost'].fillna(0, inplace=True)
districts = df['neighbourhood_group_cleansed'].unique().tolist()
model_cols = model_cols.tolist()
model_cols.remove('host_acceptance_rate')
model_cols.remove('latitude')
model_cols.remove('longitude')
model_cols.remove('review_scores_rating')
model_cols.remove('review_scores_accuracy')
model_cols.remove('review_scores_cleanliness')
model_cols.remove('review_scores_checkin')
model_cols.remove('review_scores_communication')
model_cols.remove('review_scores_location')
model_cols.remove('review_scores_value')
model_cols.remove('reviews_per_month')
model_cols.remove('Exempt')
for i in districts:
model_cols.append(i)
print('Model df shape:', df[model_cols].shape)
print('Model df shape:', df[model_cols].dropna().shape)
Model df shape: (17783, 61) Model df shape: (14971, 61)
for i in model_cols:
print(
i,
df[i].isna().sum()
)
host_response_time 2812 host_response_rate 2812 host_is_superhost 0 host_listings_count 2 host_total_listings_count 2 host_has_profile_pic 2 host_identity_verified 2 accommodates 0 bedrooms 0 beds 0 price 0 minimum_nights 0 maximum_nights 0 minimum_minimum_nights 0 maximum_minimum_nights 0 minimum_maximum_nights 0 maximum_maximum_nights 0 minimum_nights_avg_ntm 0 maximum_nights_avg_ntm 0 has_availability 0 availability_30 0 availability_60 0 availability_90 0 availability_365 0 number_of_reviews 0 number_of_reviews_ltm 0 number_of_reviews_l30d 0 instant_bookable 0 calculated_host_listings_count 0 calculated_host_listings_count_entire_homes 0 calculated_host_listings_count_private_rooms 0 calculated_host_listings_count_shared_rooms 0 private_bathrooms 0 shared_bathrooms 0 total_bathrooms 0 refrigerator 0 tv 0 iron 0 crib 0 parking 0 elevator 0 AC/heating 0 wifi 0 kitchen 0 washer 0 HUTB 0 Entire home/apt 0 Hotel room 0 Private room 0 Shared room 0 center_distance 0 Eixample 0 Sant Martí 0 Gràcia 0 Sants-Montjuïc 0 Ciutat Vella 0 Les Corts 0 Sarrià-Sant Gervasi 0 Sant Andreu 0 Horta-Guinardó 0 Nou Barris 0
!pip install shap -qq
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import lasso_path
import xgboost
import shap
df = df[model_cols].dropna()
df = df.query('price <= 500')
X = df.drop('price', axis=1)
X = preprocessing.scale(X)
y = df['price']
alphas_lasso, coefs_lasso, _ = lasso_path(X, y)
traces = []
for i in range(X.shape[1]):
trace = go.Scatter(
x = alphas_lasso,
y = coefs_lasso[i],
mode = 'lines',
name = df.drop('price', axis=1).columns[i]
)
traces.append(trace)
layout = go.Layout(
title = 'Lasso paths - Plotly',
xaxis = dict(title = 'Log($\\lambda$)'),
yaxis = dict(title = 'coefficients'),
width=1400, height=1400, template='plotly_dark',
legend = dict(orientation = 'h')
)
fig = go.Figure(data = traces, layout = layout)
fig.update_xaxes(type = 'log')
fig.show()
X = df.drop('price', axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
import xgboost as xgb
xgb_reg = xgb.XGBRegressor()
params = {
'objective': ['reg:squarederror', 'reg:linear'],
'max_depth': [7, 8],
'learning_rate': [0.1, 0.01],
'n_estimators': [150,300,450]
}
xgb_reg.set_params(**params)
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=[0.1, 0.01],
max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=[7, 8], max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=[150, 300, 450], n_jobs=None,
num_parallel_tree=None,
objective=['reg:squarederror', 'reg:linear'], ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=[0.1, 0.01],
max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=[7, 8], max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=[150, 300, 450], n_jobs=None,
num_parallel_tree=None,
objective=['reg:squarederror', 'reg:linear'], ...)from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(xgb_reg, params, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)
GridSearchCV(cv=5,
estimator=XGBRegressor(base_score=None, booster=None,
callbacks=None, colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=None,
grow_policy=None, importance_type=None,
interaction_constraints=None,
learning_rate=[0.1, 0...
max_depth=[7, 8], max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=[150, 300, 450], n_jobs=None,
num_parallel_tree=None,
objective=['reg:squarederror',
'reg:linear'], ...),
n_jobs=-1,
param_grid={'learning_rate': [0.1, 0.01], 'max_depth': [7, 8],
'n_estimators': [150, 300, 450],
'objective': ['reg:squarederror', 'reg:linear']})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5,
estimator=XGBRegressor(base_score=None, booster=None,
callbacks=None, colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=None,
grow_policy=None, importance_type=None,
interaction_constraints=None,
learning_rate=[0.1, 0...
max_depth=[7, 8], max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=[150, 300, 450], n_jobs=None,
num_parallel_tree=None,
objective=['reg:squarederror',
'reg:linear'], ...),
n_jobs=-1,
param_grid={'learning_rate': [0.1, 0.01], 'max_depth': [7, 8],
'n_estimators': [150, 300, 450],
'objective': ['reg:squarederror', 'reg:linear']})XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=[0.1, 0.01],
max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=[7, 8], max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=[150, 300, 450], n_jobs=None,
num_parallel_tree=None,
objective=['reg:squarederror', 'reg:linear'], ...)XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=[0.1, 0.01],
max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=[7, 8], max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=[150, 300, 450], n_jobs=None,
num_parallel_tree=None,
objective=['reg:squarederror', 'reg:linear'], ...)model = grid_search.best_estimator_
print('Best hyperparameters:', grid_search.best_params_)
Best hyperparameters: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 450, 'objective': 'reg:squarederror'}
predicted = model.predict(X_test)
graph_pred = predicted.copy()
fig = px.histogram(y_test - graph_pred, nbins=100)
fig.update_layout(width=780, height=625, template='plotly_white', bargap=0.05,)
fig.update_layout(
width=780, height=625,
title='',
showlegend=False,
xaxis_title="Residuals, różnice ceny",
yaxis_title="Częstość")
fig.add_vline(x=0)
import plotly.graph_objects as go
trace2 = go.Scatter(
x=y_test,
y=graph_pred,
mode='markers',
name='Scatter Plot')
trace1 = go.Scatter(
x=y_test,
y=y_test,
mode='lines',
name='Line Plot')
fig = go.Figure()
fig.add_trace(trace2)
fig.add_trace(trace1)
fig.update_layout(
width=540, height=523, margin=dict(l=30, r=30, t=30, b=30),
showlegend=False,
xaxis_title="Cena sprawdzania",
yaxis_title="Cena przewidziana",
# xaxis_range=[0, 2500000], yaxis_range=[0, 2500000]
)
fig.show()
from sklearn import metrics
MAE = metrics.mean_absolute_error(y_test, predicted)
RMSE = np.sqrt(metrics.mean_squared_error(y_test, predicted))
Score = metrics.explained_variance_score(y_test, predicted)
print(f'MAE: {MAE}')
print(f'RMSE: {RMSE}')
print(f'Score: {Score}')
MAE: 25.398717930093763 RMSE: 40.3322784967719 Score: 0.8119994928762382
shap.initjs()
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
[13:19:04] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
# z dokumentacji
'''From official documentation: The summary plot combines feature importance with feature effects.
Each point on the summary plot is a Shapley value for a feature and an instance.
The position on the y-axis is determined by the feature and on the x-axis by the Shapley value.
The color represents the value of the feature from low to high.
Overlapping points are jittered in y-axis direction, so we get a sense of the distribution
of the Shapley values per feature. The features are ordered according to their importance.'''
shap.summary_plot(shap_values, features=X, feature_names=X.columns)
cols = ['HUTB', 'accommodates', 'private_bathrooms', 'center_distance',
'minimum_nights_avg_ntm', 'maximum_minimum_nights', 'crib', 'calculated_host_listings_count_entire_homes',
'host_is_superhost', 'minimum_minimum_nights', 'host_listings_count', 'Entire home/apt', 'availability_30',
'minimum_maximum_nights', 'calculated_host_listings_count_private_rooms', 'tv', 'bedrooms', 'host_total_listings_count' ,
'calculated_host_listings_count', 'availability_60']
# # tu pojawiał się błąd z dla force_plot, wynika z konfliktów między wersjami
# # X = (pd.DataFrame.from_dict([wlochy_test, ]))
# shap_values = explainer.shap_values(X_test.sample()[cols])
# shap.force_plot(explainer.expected_value, shap_values, X, matplotlib=True)
import pickle
pickle.dump(model, open('Barc XGB.pkl', "wb"))
# xgb_model_loaded = pickle.load(open(file_name, "rb"))